Enhancing Profitability with AI: Sentiment Analysis for Sprinter Sportswear¶
👋 Introduction¶
In this notebook, we are going to enhance the profitability for Sprinter. We are going to do that by training a model that uses sentiment analysis and grading algorithms to conclude whether a product is good or bad, and if it is going to be a profit for the company. For this purpose, we are going to use Twitter Sentiment analysis dataset, Amazon reviews dataset and a Reddit dataset to train a model to distinguish positive from negative text.
Table of Contents¶
📖 Domain Understanding¶
- Background
- Research
- TICT Quick Scans
- Analyzing Sentiment and Language with NLP
📦 Data Provisioning¶
- 📋 Data Requirements
- 🗂️ Data Collection
- 📊 Data Understanding
- 🛠️ Data Preparation
🔮 Predictions¶
- 🧹 Preprocessing
- 🧬 Modelling
- 🧐 Evaluation
🥒 Saving the Models Using Pickle¶
🔧 Functions - Text Prep and Analysis¶
- Text Preparation and Preprocessing
- Sentiment Analysis Task
🎬 Demonstration¶
📝 Feedback¶
🤝 Conclusion¶
📖 Domain Understanding¶
Background¶
In today's market, the abundance of sportswear shops, both physical and online, makes it challenging to determine the true value of a product. This project aims to address this issue by creating an AI model for Sprinter, a prominent sportswear company with a substantial online and physical presence in Europe.
The AI model will employ sentiment analysis to evaluate product reviews, distinguishing between favorable and unfavorable feedback. Additionally, a grading algorithm will assign not only a percentage grade but also a corresponding emoticon to each product, helping Sprinter make data-driven decisions about the products in their inventory.
Research¶
Research Questions¶
Main question:
How do various factors shape consumer behavior in online shopping for sportswear apparel?¶
Sub-questions:
Research methods¶
| Research Method | Description |
|---|---|
| Interview | This method involves engaging with individuals or groups to gather information and insights. Interviews are great for understanding experiences and in-depth perspectives of a topic. |
| Interview with Expert | This method consists of talking with individuals who possess expertise to the related topic. They are valuable for gaining access to high-quality information, technical details and unique viewpoints. |
| Literature Review | This method involves reviewing existing articles, documents and reports. |
| Available Product Analysis | This method consists of analyzing and evaluating existing products or services. It helps understand what is already available in the market. |
My investigation into this area produced a number of significant findings. First off, there are many options available to consumers both online and offline in the extremely competitive sportswear market. According to Wang (May 2021), consumers spend more time when they buy sports products online. Furthermore, online shoppers can only predict/imagine products based on the information provided on the website. Therefore, client perceptions and purchase decisions are significantly influenced by the information provided on the website. Decisions made by customers are frequently affected by how well and how cheaply things are thought to be. Also, reviews and ratings seem to be important for the online buyers. As stated by Lackermair, Kailer and Kanmaz, for the customer, a first indicator in the decision-making process is often the user rating of a product. Once customers enter the product page, they are dependent on reading user reviews to find out if the product matches their requirements.
Interviews¶
Interview questions for the stakeholder:
- As a manager at Sprinter Sportswear, could you provide an overview of the company's current strategies and objectives in the sportswear retail industry?
- How does customer feedback currently influence decision making in Sprinter particularly in terms of product selection and marketing strategies?
- How do you envision the proposed AI model for sentiment analysis and product grading aligning with Sprinter’s objectives and strategies?
- What are your expectations for the impact of this project on sprinter's customer service and profitability?
Interview questions for the customer:
- Do you often purchase products from online stores?
- Do you practice a sport or exercise?
- As a customer who shops at Sprinter, what factors are most important to you when choosing sportswear products?
- When you are shopping online do you rely on the reviews of a certain product?
Based on the interview with the customer, I understood that people frequently look to reviews when making purchases online. The reviews can even be crucial in the decision-making process whether to buy it or not. From the interview with the manager, I learned that currently, the reviews on the website do not make an impact on the inventory decisions. However, they aim to change that, claiming that this AI model will assist them gather useful information about the products and their quality, improving customer satisfaction and their sales.
TICT Quick Scan¶
TICT Quick Scan What-If Situation¶
The next TICT Quick Scan is a bit more different. I went over all of the questions again and answered them imagining that the sentiment analysis model is already working perfectly and the company is using it everyday. Considering that we are imagining that we have already implemented the model, some of the points are not applicable at this end phase.
Analyzing Sentiment and Language with NLP¶
Natural Language Processing (NLP) has revolutionized the way we analyze sentiment and language in varios context. In this chapter, we will delve into the world of NLP, its applications, and the profound impact it has on businesses and decision-making processes. This chapter will be divided into 2 sections - The Pros and The Cons of Sentiment Analysis with NLP.
Pros of Analyzing Sentiment with NLP¶
- Enhanced Customer Experience
Businesses, such as Sprinter, can use NLP-driven sentiment analysis to understand customer feedback better. By identifying both positive and negative sentiments, they can tailor their products and services to meet customer expectations and enhance overall satisfaction.
- Uncovering Trends and Patterns
NLP enables the identification of trends and patterns within textual data, which can inform decision-making processes.
- Automation and Efficiency
NLP allows for the automation of sentiment analysis, which significantly improves efficiency. It can process large volumes of text data quickly, making it a valuable tool for businesses.
Cons of Analyzing Sentiment with NLP¶
- Data Bias
NLP models can inherit biases from the data they are trained on. This can result in the amplification of societal biases in sentiment analysis, potentially leading to inaccurate assessments.
- Sentiment Polarity
NLP models often categorize sentiment into binary positive/negaive labels, which oversimplifies the complexity of normal human emotions.
- Privacy Concerns
The use of NLP can raise concerns, particularly in context of privacy and data security. The gathering and processing of personal information may breach people's privacy or put them at risk of identity theft or threats.
Balancing the Pros and Cons¶
As we have already explored the advantages and disadvantages of NLP for sentiment analysis, it becomes clear that there is a suitable balance to use this technology effectively.
- Contextual Understanding
One of the key aspect of balancing the pros and cons is to improve the contextual understanding of models. It is essential to refine models to better grasp context. This can be achieved by training multiple models on diverse datasets.
- Addressing Bias
Mitigating data bias is crucial for maintaining balance. It is vital to implement practices to refine the biases, ensuring that the sentiment results are always fair.
- Transparancy and Accountability
Maintaining transparency and accountability is extremely important. Clearly documenting the methodology, data sources, provisioning and pre-processing steps is essential for understanding the results of the models ad their reliability.
After completing the domain understanding, we are continuing with our next phase, which is Provisioning. First, we are going to import the required libraries.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style
import re
from wordcloud import WordCloud
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from nltk.corpus import stopwords
import nltk
from collections import Counter
import pickle
import json
📦 Data provisioning¶
We'll look at the fundamentals of data provisioning in this part. The crucial process of making data accessible, usable, and ready for analysis is known as data provisioning. We'll talk about data sourcing, transformation, integration, quality, and security, giving us the knowledge and resources we need to guarantee data reliability and accessibility and pave the way for effective data-driven decisions. Data Provisioning is the phase 2 of AI Project methodology that we are going to follow. The 1st phase is the proposal which is completed. The 4 main parts of Data Provisioning are going to be Data Requirements, Data Collection, Data Understanding and Data Preparation.
📋 Data Requirements¶
The Data Requirements chapter is a crucial phase in the Data Provisioning process, building upon the data collection section. In this chapter, I am going to define the data needs for conducting senitment analysis for Sprinter. By outlining these requirements, we lay the foundation for data collection, which is the next step in our process.
Data Elements¶
- Text Data (Reviews, tweets, text)
- Data Type: Text
- Example: Product reviews
- Length: Variable, typically ranging from a few words to a paragraph
- Language: English
- Scope: The "Text Data" element encompasses a diverse range of textual content with customer reviews.
- Quality Assurance: Strict quality measures are in place in order to prevent errors and noice within the text data.
- Relation to the Target Variable: Text Data is the primary input for sentiment analysis, and the sentiment labels (Positive, Negative, Neutral) derived from this text data directly relate to the project's target variable. The goal is to accurately predict and categorize sentiment based on the text data.
- Relevance to the Project: The "Text Data" element is central to the project's objectives. It serves as the foundation of the sentiment analysis task.
- Categorical Data
- Data Type: Categorical
- Categories: Positive, Negative, Neutral (for sentiment labels)
- Example: Positive, Negative, Neutral
Data Volume¶
Estimating the desired amount of data required a thorough research of the sentiment analysis task. After completing this, I concluded that, although, there are many public datasets available with millions of rows, for my project, the volume of the data should not exceed 100 000. The minimum data volume is 10 000. The volume will vary depending on how many datasets will be included, but it will not go over the maximum. The reason behind this is to prevent overfitting and the model adapting to the data, that it performs bad in real-world situation. Based on my research, data volume between 10 000 and 100 000 promises robust sentiment analysis.
Data Quality Standards¶
My criteria for the data quality is extremely high. The accuracy, completeness, consistency, and relevance are vital for this project. To maintain data accuracy, it is imperative that the sentiment labels assigned to the text data are precise. I will implement different techniques to ensure that. Completeness promises that no critical infomarmation is missing. Consistency in labeling and data formatting across the dataset enhances the analysis process. Moreover, data relevance is essential that the collected data remains aligned with the project's objectives. Sticking to these standarts, meaningful sentiment analysis results are guarenteed.
Ensuring Data Quality for Sprinter Sportswear Sentiment Analysis
Achieving high data quality is a fundamental aspect of the sentiment analysis project for Sprinter Sportswear. Here's a detailed breakdown of the strategies and techniques I will employ to maintain the desired data quality standards:
Accuracy in Sentiment Labels:
To ensure the accuracy of sentiment labels assigned to text data, I plan to use different techniques to determine the Polarity (Sentiment) of the text to ensure accuracy. I will use trustworthy sources and libraries that will provide me with correct annotating of the text in the datasets.
Completeness of Information:
The achieving of the completeness involves thorough data collection to minimize the risk of missing critical information. Multiple public datasets will be scraped, ensuring that we have a wide range of text data. Additionally, there will be multiple checks for missing data or any anomality that could possibly arise before and after the preparation of the data.
Consistency in Data Formatting:
Consistency in data formatting is crucial for the analysis process. Clear guidelines will be established for the format of the text data and they will be strictly followed. Consistency is extremely important to prevent biases and make the data standardized. There will be multiple functions and techniques involved in the formatting of the data. Those functions and techniques will be chosen based on the requirements that we have, the datasets and their origin and proper research of the domain and the sentiment analysis task.
Relevance to Project Objectives:
To ensure data relevance, my data collection strategy is designed to allign closely to with the context of sportswear reviews. This involves gathering information that is directly tied to customer reviews for various products. The aim is to capture sentiments surrounding product performance, quality, and user experience, contributing directly to the understanding of customer preferences and sentiments.
While not explicitly linked to customer reviews or sportswear, my approach also considers broader sentiment trends. This allows to capture diverse perspectives accross different topics and industries. The inclusion of a variety of sentiments enhances the overall sentiment analysis.
Ethical and Legal Aspects¶
Ethical and Legal Aspects While Collecting the Data¶
Ensuring that the data that is collected matches ethical and legal standarts is extremely important. Respecting user privacy and obtaining permissions for using customer reviews, if needed, is crucial. For this project, public datasets are going to be used, where the data protection regalutations are not violated.
Ethical and Legal Aspects for Sprinter¶
Informed Consent
As Sprinter is going to use the sentiment analysis model for analyzing the sentiment of product reviews that are written by users, permission will be sought to uphold ethical standards. The users should be informed about the intention to use their reviews for sentiment analysis. Explicit consent should be obtained from the users, emphasizing the transparency about the purpose.
Handling Personal Information
For this project, I recognize that the reviews may contain some personal informaiton. The company is going to uphold privacy and confidentiality of the individuals that have provided that personal informaiton. The reviews are going to be used only for the analysis of the product. Although, not part of the scope of the project, Data Security is vital in this aspect. Data breeches and unauthorized access to sensitive information should be prevented.
Transparency and Fairness
The project will be conducted with transparency and fairness, providing clear information on the methodology used. The company will have access to available documentation about the project, having all the information that is needed to use the model safely without compromising ethical or legal aspects. Furthermore, transparency with the user is key for the approach. Users will be fully aware of how their data is being utilized and the potential impact on the decision-making.
Impact on Decision-Making
Importantly, the project recognizes the ethical consequences of sentiment analysis, especially its potential impact on Sprinter's decisions or sales. Steps will be taken to minimize the unintended influence, such as conducting the whole project with transparency and fairness in mind.
Data Dictionary¶
Text Data (Reviews, tweets, text)
- Data Element Name: Text
- Data Type: Text
- Description: The original text data, such as customer reviews, tweets, before sentiment analysis.
- Source: Raw text data from multiple public datasets
- Quality Standards:
- Language: English
- Accuracy: The text should accurately represent the content of the reviews or tweets.
- Completeness: The text should not be missing critical information and should be intact.
- Consistency: The text should follow consistent formatting and conventions.
- Relevance: Ensure that the collected text data remains relevant to the project's objectives.
TextBlob Results
Data Element Name: Polarity
- Data Type: Numerical
- Description: Polarity score determined using TextBlob. Ranges from -1 (negative) to 1 (positive).
- Source: Output from TextBlob
Data Element Name: Sentiment
- Data Type: Categorical (Positive, Negative, Neutral)
- Description: Sentiment labels determined using the values of the Polarity provided by TextBlob.
- Source: Output from TextBlob
VADER Results
VADER is a widely recognized and established sentiment analysis tool, known for its consistent and dependable results. Its track record in accurately assessing sentiment makes it a trusted choice for deriving sentiment scores, ensuring the reliability of the analysis outcomes.
Data Element Name: pos
- Data Type: Numerical
- Description: Positive score determined using VADER.
- Source: Output from VADER
Data Element Name: neg
- Data Type: Numerical
- Description: Negative score determined using VADER.
- Source: Output from VADER
Data Element Name: neu
- Data Type: Numerical
- Description: Neutral score determined using VADER.
- Source: Output from VADER
Data Element Name: compound
- Data Type: Numerical
- Description: Compound score determined using VADER, by using the pos, neg, and neu scores.
- Source: Output from VADER
🗂️ Data Collection¶
Data collection is a foundational step in an Artificial Intelligence (AI) project's data provisioning process. The aim of data collection is to acquire a representative and informative dataset that enables the ML model to understand patterns, make accurate predictions, or to perform other tasks as required by the project's objectives. Let's delve into the activities and aims of the data collection step.
Having already defined our data requirements, we know what we are searching for. However, it is important to note that we may not find the exact data we are searching for. We will need to make changes and improvements. With that being said let's dive into the data collection.
Collect the Data¶
There are many data sources available, but the one I chose is Kaggle. After a thorough research of possible datasets that can be suitable for this project, I decided to rely on publicly available datasets from Kaggle.
Data Storage¶
Data storage is a critical consideration in data collection. If the dataset/s are going to be with large sizes, then a cloud is a suitable option. For now, I am relying on local storage.
Extending or Limiting the Dataset¶
Data Fractionation¶
In this project, as mentioned, we are going to incorporate multiple datasets from different platforms that include various content (text). Furthermore, we are going to use sampling to get a subset of data for the analysis. For this project, we do not have the possibility to work with millions of rows in a dataset, as it requires a lot of time for the models to train. Thus, we are going to get fractions of multiple datasets to crete 1 dataset that is not redundant, but still contains enough data for the predictions. If needed, bigger samples can be added.
Data Unification¶
To create a unified dataset, I will merge these fractional datasets using the 'concatenation' method. This approach allows us to comibine the individual datasets while ensuring data integrity and consistency. By concatenating the datasets, we create a cohesive dataset that maintains the unique characterestics of each source, without redundancy. This unified dataset will serve as the foundation for the sentiment analysis, ensuring that we have a diverse source of data for predictions.
In the process of merging the datasets, I recognize that while each source provides valuable text data, variation in format and content may exist. Therefore, I have implemented data preparation steps to ensure that the text data is compatible. By following standard steps, the data is going to be ready for concatenation.
Scalability and Reproducibility¶
Recognizing the potential for dataset updates or extensions, I have designed the data preparation process to be scalable and reproducible. Rather than relying on manual intervention, the process follows standardized steps. If new data needs to be incorporated, the provisioning phase can be easily repeated by following the established data preparation pipeline.
By treating the data preparation as a systematic process, updates and extensions to the dataset can be seamlessly integrated without extensive manual work. This approach ensures that the dataset remains adaptable to evolving requirements.
Twitter dataset¶
The first dataset, that I decided to start with, is the Twitter Dataset. It is commonly known as sentiment140 dataset and it contains 1.6 million tweets that have been extracted using the Twitter API. The dataset contains the following fields:
- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- ids: The id of the tweet ( 2087)
- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009), providing a temporal dimension to the dataset.
- flag: The query (lyx). If there is no query, then this value is NO_QUERY.
- user: the user that tweeted (robotickilldozr)
- text: the text of the tweet (Lyx is cool)
The tweets in this dataset were collected in 2009 over a specific time period. It is important to note that this dataset is unique in that is was not manually annotated by humans. Instead, the creators applied an innovative approach based on emoticons within tweets. Tweets containing positive emoticons were labeled as positive, while those with negative emoticons were labeled as negative. This method allowed for the automatic generation of a large training dataset, differentiating it from traditional manual annotation.
The official paper detailing the dataset's creation can be ound in the citation: Go, A., Bhayani, R., and Huang, L., 2009. 'Twitter sentiment classification using distant supervision.' CS224N Project Report, Stanford, 1(2009), p.12.
At the start of this project, a subset of this dataset was selected, comprising 14,000 rows, evenly split with 7,000 positive and 7,000 negative tweets, to serve as the basis for sentiment analysis. However, now that the project is coming to an end, I decided to add a little bit bigger sample of the dataset with a total of 30,000 tweets - 15,000 Positive and 15,000 Negative.
Amazon Dataset¶
The second dataset that I chose is filled with Amazon Reviews. The dataset contains 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). The subset, provided on Kaggle, contains 1,800,000 training samples for each sentiment.
The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings and a plaintext review. The creator of the dataset is Xiang Zhang.
The Amazon reviews polarity is constructed by taking review score 1 as negative and 2 as positive. Each class has 1,800,000 training samples. The columns in the dataset are:
- polarity: 1 for negative and 2 for positive
- tittle: review heading
- text: review body
As for the beginning of this project, a targeted sample of 14,000 rows was extracted from the extensive Amazon Review Polarity Dataset. This subset was curated to include an equal distribution of both positive and negative sentiment reviews, ensuring a balanced representation for sentiment analysis. However, a bigger sample was curated of this dataset as well, as the project is coming to an end and we can experiment with more data to make the predictions even more accurate. The total amount of rows in the sample is now 30,000 - 15,000 Positive and 15,000 Negative.
Reddit dataset¶
The last dataset that I decided to include in my project is from Reddit. It consists of user comments with their sentiment labels. This dataset was curated as part of a university project that aimed to conduct sentiment analysis across various multi-source social media platforms. To gather the Reddit comments and sentiment labels, the PRAW API was employed, ensuring a rich source of textual data for analysis. The content of these comments primarily revolves around discussions related to prominent leaders. The sentiment labels assigned to each comment range from -1 to 1, signifying negative, neutral, and positive sentiments, respectively. This dataset, comprising approximately 37,000 comments, offers a valuable source of information for sentiment analysis.
📃 Sample the data¶
After loading the data, we are going to sample the data, to get a first impression and have a look at the features. We are going to first examine the datasets separately.
# twitter dataset
df = pd.read_csv("longer_twitter.csv", encoding="latin-1")
# df = pd.read_csv("/kaggle/input/twitterdataset/longer_twitter.csv", encoding="latin-1")
column_names = ['Target', 'ID', 'Date', 'Flag', 'User', 'Text']
df.columns = column_names
df.sample(5)
| Target | ID | Date | Flag | User | Text | |
|---|---|---|---|---|---|---|
| 10291 | 4 | 1793572550 | Thu May 14 03:12:25 PDT 2009 | NO_QUERY | Happy_BaBies | excited for things that are coming my way... |
| 11269 | 0 | 1989193426 | Mon Jun 01 00:39:56 PDT 2009 | NO_QUERY | Sushilkrishnan | Monday Blues ! |
| 25909 | 0 | 1961871203 | Fri May 29 10:10:13 PDT 2009 | NO_QUERY | deekgeek | @jonhatesyou aw my dear i'm sorry |
| 21728 | 4 | 1834182529 | Mon May 18 01:56:57 PDT 2009 | NO_QUERY | joolzgirl | @artclubcaucasus thank you |
| 17923 | 0 | 2234819825 | Thu Jun 18 23:49:18 PDT 2009 | NO_QUERY | tinaaza | capee . |
# amazon dataset
df2 = pd.read_csv("longer_amazon.csv")
# df2 = pd.read_csv("/kaggle/input/amazondataset/longer_amazon.csv")
column_names2 = ['Target', 'Heading', 'Text']
df2.columns = column_names2
df2.sample(5)
| Target | Heading | Text | |
|---|---|---|---|
| 24803 | 2 | Clay Aiken's debut proves he will be around fo... | This album is an excellent "POP" album - EVERY... |
| 14434 | 1 | No hay caso, no entra | no se puede usar ninguna chaqueta de expansió... |
| 19775 | 2 | Ya Gotta Love It! | Dead Prez, Jilly from Philly, Erykah Badu,Mos ... |
| 27073 | 2 | great! | I just really loved this book.Such strong yet ... |
| 19848 | 2 | Is this book really for kids? | I found this book to be immensely entertaining... |
# reddit dataset
df3 = pd.read_csv("reddit.csv")
# df3 = pd.read_csv("/kaggle/input/redditdataset/reddit.csv")
df3.sample(5)
| clean_comment | category | |
|---|---|---|
| 26960 | kratos ragnarok | 0 |
| 23025 | everytime you get raise think about the increa... | 0 |
| 27925 | minute with republic and zee news all about t... | 1 |
| 2681 | maybe part the problem the very high rates fem... | 1 |
| 28432 | cpi maoist revolution when | 0 |
📊 Data Understanding¶
In this step, we are going to explore the dataset, make visualizations and derive conclusions for our next steps. Data Understanding is a really important step where we can see trends in our datasets and decide what approach to have. I am going to explore each dataset separately. From the samples that we saw, the Twitter Dataset has the most features that we can explore. The other 2 datasets mostly constist of the acutal content (text). We are going to explore all the features in each dataset and we are going to foucs on the text part.
Exploring the Twitter Dataset ('df')¶
The exploration of the Twitter Dataset is going to start with the most common steps such as the data types of each column, the number of null values and the value counts in the Target column.
df.sample(5)
| Target | ID | Date | Flag | User | Text | |
|---|---|---|---|---|---|---|
| 11055 | 4 | 1687496722 | Sun May 03 08:51:09 PDT 2009 | NO_QUERY | cindysjourney | Spend 20 minutes every day in nature by yourse... |
| 16362 | 0 | 2263790964 | Sun Jun 21 01:49:19 PDT 2009 | NO_QUERY | tahn66 | @mis_diva shops shops shops. thanks god for co... |
| 27102 | 4 | 2176783170 | Mon Jun 15 04:44:39 PDT 2009 | NO_QUERY | Moniqueeve | is laying in myy big warm bed and being thankf... |
| 13713 | 0 | 1974986446 | Sat May 30 14:12:36 PDT 2009 | NO_QUERY | OwenGerrard | @BarryHarveyUK yeah I don't doubt it mate. Dis... |
| 29977 | 0 | 2003059763 | Tue Jun 02 05:47:36 PDT 2009 | NO_QUERY | petwebdesigner | @Candylatte Must be a U.S. thing |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30000 entries, 0 to 29999 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Target 30000 non-null int64 1 ID 30000 non-null int64 2 Date 30000 non-null object 3 Flag 30000 non-null object 4 User 30000 non-null object 5 Text 30000 non-null object dtypes: int64(2), object(4) memory usage: 1.4+ MB
df.isnull().sum()
Target 0 ID 0 Date 0 Flag 0 User 0 Text 0 dtype: int64
twitter_counts = df['Target'].value_counts()
twitter_counts
Target 4 15000 0 15000 Name: count, dtype: int64
For the this dataset, we have equally positive and negative tweets, as I sampled the dataset myself and decided to get equally positive and negative tweets. However, the important thing to notice is that the labels do not provide any neutral values. For the goal of this project, we need also neutral labels.
df['Text_Length'] = df['Text'].apply(len)
print(df['Text_Length'].describe())
count 30000.000000 mean 74.197333 std 36.709671 min 7.000000 25% 44.000000 50% 69.000000 75% 104.000000 max 514.000000 Name: Text_Length, dtype: float64
Let's break down the results. We have 30,000 text entries with non-null values in the 'Text_Length' column. The average length is approximately 74 characters. The standard deviation is around 36.7. A higher standard deviation suggests more variability. The smallest text length is 7 characters. 25% of the text length is less than or equal to 44 characters. The median text length is 69 characters. 75% of the text lengths are less than or equal to 104 characters. The longest text in the dataset has 514 characters.
sns.histplot(df['Text_Length'], bins=30)
plt.title('Distribution of Tweet Lengths')
plt.show()
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
We can see that the tweet length is mostly between 10 and 150. There are some entries that have more characters but they are insignificant number.
Now, we are going to see how the Text Length varies per Sentiment. In the Twitter Dataset, we only have 2 targets - Positive and Negative. We can see that the length of the text regarding both of the sentiments is in the same range. So, for the Twitter Dataset, the length does not affect the sentiment of the text.
sentiments = df['Target'].unique()
fig, (ax1, ax2) = plt.subplots(nrows=len(sentiments), sharex=True, figsize=(7, 5))
for i, sentiment in enumerate(sentiments):
subset = df[df['Target'] == sentiment]
color = 'red' if sentiment == 0 else 'green'
ax = ax1 if sentiment == 0 else ax2
ax.scatter(subset['Text_Length'], subset.index, color=color, label=f'Sentiment {sentiment}', alpha=0.5)
ax.set_title(f'Text Length vs. Sentiment (Sentiment {sentiment})')
ax.set_xlabel('Text Length')
ax.set_ylabel('Index')
ax.legend()
plt.tight_layout()
plt.show()
Next, we are going to explore the most common words and the average number of words per entry. Since, we have not done any cleaning or preprocessing steps, the most common words are the stopwords. The stopwords are words like 'the', 'and', 'a'. Those words are actually not important for sentiment analysis task. We are going to look again into the most common words after we have prepared the data.
all_words = ' '.join(df['Text']).lower().split()
word_counts = Counter(all_words)
common_words = word_counts.most_common(10)
print("Top 10 common words:", common_words)
Top 10 common words: [('i', 14109), ('to', 10464), ('the', 9508), ('a', 6985), ('my', 6006), ('and', 5604), ('you', 4481), ('is', 4329), ('for', 4032), ('in', 3937)]
Now, we are going to see the most common words related to each sentiment by creating a function. This function is going to get the top 10 most common words based on the sentiment and the dataframe. Since Amazon and Twitter have the same columns, we are going to reuse the same function lataer for Amazon, and create a new one for Reddit.
def get_common_words_twitter_amazon(dataframe, sentiment):
subset = dataframe[dataframe['Target'] == sentiment]
all_words = ' '.join(subset['Text'][subset['Text'].notnull()]).lower().split()
word_counts = Counter(all_words)
common_words = word_counts.most_common(10)
return common_words
common_words_positive_twitter = get_common_words_twitter_amazon(df, 4)
common_words_negative_twitter = get_common_words_twitter_amazon(df, 0)
print("Top 10 common words for Positive Sentiment:", common_words_positive_twitter)
print("Top 10 common words for Negative Sentiment:", common_words_negative_twitter)
Top 10 common words for Positive Sentiment: [('i', 5372), ('the', 4903), ('to', 4719), ('a', 3582), ('you', 2881), ('and', 2827), ('my', 2385), ('for', 2222), ('is', 2001), ('in', 1852)]
Top 10 common words for Negative Sentiment: [('i', 8737), ('to', 5745), ('the', 4605), ('my', 3621), ('a', 3403), ('and', 2777), ('is', 2328), ('in', 2085), ('it', 1929), ('for', 1810)]
The results are not valuable, since we have not removed the stopwords yet.
df['Num_Words'] = df['Text'].apply(lambda x: len(str(x).split()))
print("Average number of words per tweet:", df['Num_Words'].mean())
Average number of words per tweet: 13.174933333333334
common_word_lengths = {word: len(word) for word, _ in common_words}
print("Length of words used often:", common_word_lengths)
Length of words used often: {'i': 1, 'to': 2, 'the': 3, 'a': 1, 'my': 2, 'and': 3, 'you': 3, 'is': 2, 'for': 3, 'in': 2}
In the next code, we are going to see a visualization with the most common words related to Positive and Negative sentiments.
positive_text = df[df['Target'] == 4]['Text']
negative_text = df[df['Target'] == 0]['Text']
def generate_and_plot_wordcloud(text, title, ax):
wordcloud = WordCloud(width=400, height=200, random_state=21, max_font_size=119, background_color='white').generate(text)
ax.imshow(wordcloud, interpolation="bilinear")
ax.axis('off')
ax.set_title(title, fontsize=12)
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
generate_and_plot_wordcloud(' '.join(positive_text), 'Positive Sentiment', axs[0])
generate_and_plot_wordcloud(' '.join(negative_text), 'Negative Sentiment', axs[1])
plt.suptitle('Word Clouds for Different Sentiments in Twitter Dataset', fontsize=16)
plt.tight_layout()
plt.show()
In our dataset, as for now, we have 30,000 entries. In the dataset, we also have the usernames of the users who have tweeted. We can see the Top 10 users, however, the number of their tweets is not that big. 'tweetpet' which is the number 1 user has 10 tweets in total. With that being said, we can not draw any significant conclusions to relate the users to either positive or negative sentiments.
top_users = df['User'].value_counts().nlargest(10)
plt.figure(figsize=(10, 6))
top_users.plot(kind='barh', color='skyblue')
plt.title('Top 10 Users Contributing to the Dataset')
plt.xlabel('Number of Tweets')
plt.ylabel('User')
plt.show()
The next feature that we are going to explore is the Date. As we already know, the Twitter Dataset obtains a time period in 2009. The months in that period are from around April to June. In this time period, we have the most tweets in the end of May till the end of June.
df['Date'] = pd.to_datetime(df['Date'], format='%a %b %d %H:%M:%S PDT %Y')
plt.figure(figsize=(12, 6))
df['Date'].dt.date.value_counts().sort_index().plot()
plt.title('Distribution of Tweets Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Tweets')
plt.show()
Exploring the Amazon Dataset ('df2')¶
Now, we are continuing with the exploration of the Amazon dataset. We are going to follow the same logic and methodology as the Twitter dataset.
df2.sample(5)
| Target | Heading | Text | |
|---|---|---|---|
| 22252 | 2 | Their best before "Rubber Soul" | Before I got this CD, I assumed from the revie... |
| 9101 | 1 | Marvel-ously dissappointing | The film is AMAZING and well worth the wait bu... |
| 13206 | 1 | One of the Worst | If you weren't melanchy when you rent this tit... |
| 4064 | 2 | Better than a poke in the eye with a sharp sti... | Nuances of dio, megadeth, alice in chains, ban... |
| 23570 | 2 | Points for originality | While I definitely can't say this was in the t... |
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30000 entries, 0 to 29999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Target 30000 non-null int64 1 Heading 29999 non-null object 2 Text 30000 non-null object dtypes: int64(1), object(2) memory usage: 703.3+ KB
df2.isnull().sum()
Target 0 Heading 1 Text 0 dtype: int64
amazon_counts = df2['Target'].value_counts()
amazon_counts
Target 2 15000 1 15000 Name: count, dtype: int64
Here once again, the Positive and Negative sentiments are equally distributed because of the sampling. The important thing is that we are missing Netrual class.
df2['Text_Length'] = df2['Text'].apply(len)
print(df2['Text_Length'].describe())
count 30000.000000 mean 407.029700 std 235.807764 min 39.000000 25% 206.000000 50% 358.000000 75% 570.000000 max 1008.000000 Name: Text_Length, dtype: float64
Let's break down the results. We have 30,000 entries with non-null values in the 'Text-Length' column. The average length is 407 characters, which is significanly bigger than the Twitter Dataset. The standard deviation is 235.8 which indicated considerable amount of variability in the lengths of the text. The smallest text length is 39 characters. 25% of the text lengths are less than or equal to 206 characters. The median text lenght is 358 characters. 75% of the text lengths are less than or equal to 570 characters. The longest text in the dataset is 1008 characters.
We can see here totally different results from the Twitter Dataset. The text entries in the Amazon dataset are lot bigger than the tweets.
sns.histplot(df2['Text_Length'], bins=30)
plt.title('Distribution of Review Lengths')
plt.show()
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
Now, we are going to see if the text length has a pattern accross the sentiments. We can see that for the Amazon Dataset, the text length does not have any significant trend amongst the sentiments. Both sentiments - positive and negative - have the same pattern of length of the text.
sentiments = df2['Target'].unique()
fig, (ax1, ax2) = plt.subplots(nrows=len(sentiments), sharex=True, figsize=(7, 5))
for i, sentiment in enumerate(sentiments):
subset = df2[df2['Target'] == sentiment]
color = 'red' if sentiment == 1 else 'green'
ax = ax1 if sentiment == 1 else ax2
ax.scatter(subset['Text_Length'], subset.index, color=color, label=f'Sentiment {sentiment}', alpha=0.5)
ax.set_title(f'Text Length vs. Sentiment (Sentiment {sentiment})')
ax.set_xlabel('Text Length')
ax.set_ylabel('Index')
ax.legend()
plt.tight_layout()
plt.show()
On the hist plot, we can see that the lengths of the text vary between 10/20 to 1000. We have more entries towards the lower numbers, but we have also a lot of text values in the bigger as well.
all_words_amazon = ' '.join(df2['Text']).lower().split()
word_counts_amazon = Counter(all_words_amazon)
common_words_amazon = word_counts_amazon.most_common(10)
print("Top 10 common words:", common_words_amazon)
Top 10 common words: [('the', 113564), ('and', 61611), ('i', 58184), ('a', 56087), ('to', 55881), ('of', 44963), ('this', 40418), ('is', 39684), ('it', 38975), ('in', 26149)]
Now, we are going to see the most common words per sentiment for this dataset. We are going to be reusing the function that we have already created for the Twitter Dataset.
common_words_positive_amazon = get_common_words_twitter_amazon(df2, 2)
common_words_negative_amazon = get_common_words_twitter_amazon(df2, 1)
print("Top 10 common words for Positive Sentiment:", common_words_positive_amazon)
print("Top 10 common words for Negative Sentiment:", common_words_negative_amazon)
Top 10 common words for Positive Sentiment: [('the', 53437), ('and', 32698), ('a', 28409), ('to', 26570), ('i', 26509), ('of', 22759), ('is', 20801), ('this', 19482), ('it', 17878), ('in', 13313)]
Top 10 common words for Negative Sentiment: [('the', 60127), ('i', 31675), ('to', 29311), ('and', 28913), ('a', 27678), ('of', 22204), ('it', 21097), ('this', 20936), ('is', 18883), ('in', 12836)]
Once again, the most common words are the stopwords.
df2['Num_Words'] = df2['Text'].apply(lambda x: len(str(x).split()))
print("Average number of words per review:", df2['Num_Words'].mean())
Average number of words per review: 74.43476666666666
common_word_lengths_amazon = {word: len(word) for word, _ in common_words_amazon}
print("Length of words used often:", common_word_lengths_amazon)
Length of words used often: {'the': 3, 'and': 3, 'i': 1, 'a': 1, 'to': 2, 'of': 2, 'this': 4, 'is': 2, 'it': 2, 'in': 2}
Once again, the most common words are the stopwords. The average number of words per review is 74.
In the next plot, we are going to see the words related to Positive and Negative Sentiments.
positive_text = df2[df2['Target'] == 2]['Text']
negative_text = df2[df2['Target'] == 1]['Text']
def generate_and_plot_wordcloud(text, title, ax):
wordcloud = WordCloud(width=400, height=200, random_state=21, max_font_size=119, background_color='white').generate(text)
ax.imshow(wordcloud, interpolation="bilinear")
ax.axis('off')
ax.set_title(title, fontsize=12)
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
generate_and_plot_wordcloud(' '.join(positive_text), 'Positive Sentiment', axs[0])
generate_and_plot_wordcloud(' '.join(negative_text), 'Negative Sentiment', axs[1])
plt.suptitle('Word Clouds for Different Sentiments in Amazon Dataset', fontsize=16)
plt.tight_layout()
plt.show()
Exploring the Reddit Dataset ('df3')¶
The last dataset that we are going to explore is the Reddit Dataset. We are going to follow the same approach.
df3.sample(5)
| clean_comment | category | |
|---|---|---|
| 23732 | corlick wearing skirt that too short for her d... | 0 |
| 22650 | india really bad actually enforcing laws the e... | 1 |
| 22005 | kumaraswamy verdict will fast tracked and del... | 1 |
| 8762 | tried bring land reforms allow corporates stea... | -1 |
| 8758 | this sub reddit certified modi fan group doesn... | 1 |
df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 37249 entries, 0 to 37248 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 clean_comment 37149 non-null object 1 category 37249 non-null int64 dtypes: int64(1), object(1) memory usage: 582.1+ KB
df3.isnull().sum()
clean_comment 100 category 0 dtype: int64
reddit_counts = df3['category'].value_counts()
reddit_counts
category 1 15830 0 13142 -1 8277 Name: count, dtype: int64
The first thing to mention is that we have 100 missing values in the 'clean_comment' column. We are going to address this issue further in the Data Preparation step. As for now, in order to continue exploring the dataset, we are going to filter the null values to not get an error.
The second thing to notice is that here we have all 3 classes available - Positive, Negative and Neutral.
df3['Text_Length'] = df3['clean_comment'].apply(lambda x: len(str(x)) if pd.notnull(x) else 0)
print(df3['Text_Length'].describe())
count 37249.000000 mean 180.901742 std 358.096484 min 0.000000 25% 38.000000 50% 80.000000 75% 183.000000 max 8665.000000 Name: Text_Length, dtype: float64
One last time, let's break down the results. There are 37,349 entries in the dataset with non-null values in the 'Text_Length' column (we have filtered out the missing values, since we are getting an error if we include them). The averga length of the text is approximately 180.9 characters. The standard deviation is around 258.1 and indicates a wide range of text lengths. The smallest text length is 0, because of the missing values. 25% of the text lengths are less than or equal to 38 characters. The median text length is 80 characters. 75% of the text lengths are less than or equal to 183 characters. The longest text is 8665 characters, which is the longest entry out of all 3 datasets.
sns.histplot(df3['Text_Length'], bins=30)
plt.title('Distribution of Comment Lengths')
plt.show()
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
We can see that for this dataset, the numbers on the axis are quite different and significantly higher. We can examine that most of the text values are between 0 and 1000. Then we have more going to the 2000. Furthermore, we have some entries in the bigger numbers.
Now, we are going to see how this text length relates to the specific sentiment in the Reddit Dataset. Here, we can see that, not like in the previous datasets, there is an actual trend amongst the text length regarding the specific sentiment. We have already seen that the Reddit Dataset has quite long comments if we compare them to the Twitter and Amazon. However, we can see that shortest comments are the neutral ones. A reason behind that may be that neutral statements are rather short. If it is going to be longer, probably the sentiment will change to either positive or negative. People write to express their emotions in longer texts, but when they are neutral about something, they do not have much to write. Like I said, in this dataset we can only see that there is some trend. The reason behind is that we have 3 sentiment classes in the Reddit Dataset. If we look into the scatterplots, we can see that the the text lengths regarding the positive and negative sentiments do not differ much again. However, we can see that positive text can be longer than negative, but the variation is not that significant to lead to any major conclusions.
sentiments = df3['category'].unique()
fig, (ax1, ax2, ax3) = plt.subplots(nrows=len(sentiments), sharex=True, figsize=(7, 5))
for i, sentiment in enumerate(sentiments):
subset = df3[df3['category'] == sentiment]
color = 'red' if sentiment == -1 else ('green' if sentiment == 1 else 'blue')
ax = ax1 if sentiment == -1 else (ax2 if sentiment == 0 else ax3)
ax.scatter(subset['Text_Length'], subset.index, color=color, label=f'Sentiment {sentiment}', alpha=0.5)
ax.set_title(f'Text Length vs. Sentiment (Sentiment {sentiment})')
ax.set_xlabel('Text Length')
ax.set_ylabel('Index')
ax.legend()
plt.tight_layout()
plt.show()
all_words_reddit = ' '.join(df3['clean_comment'][df3['clean_comment'].notnull()]).lower().split()
word_counts_reddit = Counter(all_words_reddit)
common_words_reddit = word_counts_reddit.most_common(10)
print("Top 10 common words:", common_words_reddit)
Top 10 common words: [('the', 57713), ('and', 28958), ('that', 15381), ('this', 13488), ('for', 12987), ('you', 11736), ('are', 10568), ('not', 8700), ('they', 8658), ('have', 8385)]
Now, we are going to see again the most common words for each sentiment category. First we are going to create a new function that gives us the common words based on the sentiment and the dataframe.
def get_common_words_reddit(dataframe, sentiment):
subset = dataframe[dataframe['category'] == sentiment]
all_words = ' '.join(subset['clean_comment'][subset['clean_comment'].notnull()]).lower().split()
word_counts = Counter(all_words)
common_words = word_counts.most_common(10)
return common_words
common_words_positive = get_common_words_reddit(df3, 1)
common_words_neutral = get_common_words_reddit(df3, 0)
common_words_negative = get_common_words_reddit(df3, -1)
print("Top 10 common words for Positive Sentiment:", common_words_positive)
print("Top 10 common words for Neutral Sentiment:", common_words_neutral)
print("Top 10 common words for Negative Sentiment:", common_words_negative)
Top 10 common words for Positive Sentiment: [('the', 37445), ('and', 19205), ('that', 9960), ('for', 8325), ('this', 8035), ('you', 7117), ('are', 6496), ('have', 5470), ('they', 5345), ('not', 5312)]
Top 10 common words for Neutral Sentiment: [('the', 5730), ('and', 2016), ('this', 1624), ('you', 1571), ('for', 1350), ('that', 1297), ('are', 1091), ('same', 1031), ('not', 863), ('modi', 847)]
Top 10 common words for Negative Sentiment: [('the', 14538), ('and', 7737), ('that', 4124), ('this', 3829), ('for', 3312), ('you', 3048), ('are', 2981), ('they', 2551), ('not', 2525), ('have', 2202)]
As we still have the stopwords, the results are not very distinguishing or groundbreaking, so we are going to create the function again after cleaning the text.
df3['Num_Words'] = df3['clean_comment'].apply(lambda x: len(str(x).split()))
print("Average number of words per comment:", df3['Num_Words'].mean())
Average number of words per comment: 29.327579263872856
common_word_lengths_reddit = {word: len(word) for word, _ in common_words_reddit}
print("Length of words used often:", common_word_lengths_reddit)
Length of words used often: {'the': 3, 'and': 3, 'that': 4, 'this': 4, 'for': 3, 'you': 3, 'are': 3, 'not': 3, 'they': 4, 'have': 4}
Again, the most common words are the stopwords. The average number of words per comment is 29.3.
In the next plot, we will see the most common words related to all 3 categories. The words in this dataset may be a little bit strange, since the dataset is gathered from a discussion of a political situation. However, this dataset provides us with valuable information and comprehensive text entries.
positive_text = df3[df3['category'] == 1]['clean_comment']
negative_text = df3[df3['category'] == -1]['clean_comment']
neutral_text = df3[df3['category'] == 0]['clean_comment']
def generate_and_plot_wordcloud(text, title, ax):
wordcloud = WordCloud(width=400, height=200, random_state=21, max_font_size=119, background_color='white').generate(text)
ax.imshow(wordcloud, interpolation="bilinear")
ax.axis('off')
ax.set_title(title, fontsize=12)
positive_text = positive_text.apply(lambda x: str(x) if not isinstance(x, str) else x)
negative_text = negative_text.apply(lambda x: str(x) if not isinstance(x, str) else x)
neutral_text = neutral_text.apply(lambda x: str(x) if not isinstance(x, str) else x)
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
generate_and_plot_wordcloud(' '.join(positive_text), 'Positive Sentiment', axs[0])
generate_and_plot_wordcloud(' '.join(negative_text), 'Negative Sentiment', axs[1])
generate_and_plot_wordcloud(' '.join(neutral_text), 'Neutral Sentiment', axs[2])
plt.suptitle('Word Clouds for Different Sentiments in Reddit Dataset', fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
As a conclusion, all datasets have different trends and vary a lot from one another. The Twitter dataset has the shortest text entries, whereas the Reddit dataset has the longest. All of the datasets contain different words that are related to Positive and Negative sentiment, which allows us to make a diverse dataset, containing text from different sources.
🛠️ Data Preparation¶
This is the last step of Phase 2 - Provisioning. We are going to prepare the data for the modelling. After we have gathered some insights on the data, we can make suitable conclusions and take some action towards fixing the dataset.
Firstly, we are going to remove some columns from 2 of the datasets that we do not need for our task. However, we are going to keep the columns that we made, indicating the Text Length and the Number of Words.
I decided to trop the columns with the Target/category. For this project, we need Positive, Negavive, and Neutral classes. Since, 2 of the datasets do not provide us with this, we are going to ge the polarity of the text using common libraries that are trustworthy and consistent.
# twitter dataset
text_df = df.drop(['ID', 'Date', 'Flag', 'User', 'Target'], axis=1)
text_df.sample(5)
| Text | Text_Length | Num_Words | |
|---|---|---|---|
| 20656 | I'm featured today on www.askbabykid.com today... | 65 | 9 |
| 8022 | @replytommcfly hiya tom .. im sophie, from tha... | 117 | 21 |
| 20870 | feeling a little lonely on a friday night in p... | 56 | 10 |
| 1745 | TweetDock doesn't let me to send any new tweet... | 130 | 23 |
| 26535 | @guybatty my pleasure -here you are! anytime lol | 49 | 8 |
# amazon dataset
text_df2 = df2.drop(['Heading', 'Target'], axis=1)
text_df2.sample(5)
| Text | Text_Length | Num_Words | |
|---|---|---|---|
| 28432 | The style is highly suspicious and can be desc... | 701 | 123 |
| 8477 | Old ideas of a "PC centric" world rewritten to... | 313 | 55 |
| 11785 | I was first exposed to Extreme Bop-It at a ret... | 273 | 54 |
| 11375 | THis game is so boring! You walk around trying... | 227 | 44 |
| 15504 | I purchased this for my mom as a christmas gif... | 202 | 41 |
For the Reddit dataset, we have a little bit more work, we need to drop the column 'category', but also rename the column 'clean_comment' to 'Text', so it is the same as the previous datasets. This makes the process easier. Furthermore, we need to delete the rows that had missing values, as we saw when we explored the dataset for missing values.
Before we get to this work, let's see the rows that contain the missing values.
null_rows = df3[df3['clean_comment'].isnull()]
print(null_rows)
clean_comment category Text_Length Num_Words 413 NaN 0 0 1 605 NaN 0 0 1 2422 NaN 0 0 1 2877 NaN 0 0 1 3307 NaN 0 0 1 ... ... ... ... ... 35975 NaN 0 0 1 36036 NaN 0 0 1 37043 NaN 0 0 1 37111 NaN 0 0 1 37238 NaN 0 0 1 [100 rows x 4 columns]
As we can see, the missing data is in the clean_comment column, which is the column that we are going to need to do our analysis. This data appears to be Missing Completely at Random (MCAR). MCAR means that the absence of the data does not depend on any observed or unobserved variables in the dataset. In our case, we do not need these missing values and there is no way to obtain them. Since the dataset onlu has 2 columns, it is reasonable to assume that the missing values are randomly distributed. Given the context, it is unlikely that the missing data is Missing at Random (MAR) or Missing Not at Random (MNAR). It it were MAR, the probability of missing data would depend on the vvalues of other variables in the dataset, which is not the case here. MNAR implies that the missing values are related to some unobserved factors, making it hard to ignore them. To conclude, in this scenario, it is appropriate to treat these missing values as MCAR, and since they represent a really small portion of the dataset, we can proceed without them.
# reddit dataset
# renaming the column
df3 = df3.rename(columns={'clean_comment': 'Text'})
# dropping the category column and then dropping the missing values
text_df3 = df3.drop(['category'], axis=1)
text_df3 = text_df3.dropna(subset=['Text'])
text_df3.sample(5)
| Text | Text_Length | Num_Words | |
|---|---|---|---|
| 1266 | local paper reports 122 object captured satel... | 68 | 10 |
| 34708 | fun when raga lose both amethi and wayanadu | 44 | 8 |
| 988 | 100th comment this day stickied post | 36 | 6 |
| 11526 | good the number times supported other dictator... | 104 | 16 |
| 11083 | another day another disappointment | 35 | 4 |
Now, the three datasets have the same columns and are ready to be merged. I am going to use a function from pandas to concatenate the datasets. Merging the datasets at this stage is a good idea for several reasons:
- Consistency in Columns: By already ensuring that all three datasets have the same columns, we create consistency and facilitate a seamless merging process.
- Holistic Analysis: Merging the datasets at this stage, allows us to perform a more comprehensive analysis on a larger and more diverse dataset. This can eventually lead to better insights and findings.
- Avoiding Redundancy in Cleaning: Performing data cleaning tasks on a unified dataset eliminates the redundancy. Instead of repeating the cleaning process for each individual dataset, we apply it to the merged dataset.
- Efficiency in Data Processing: Combining datasets early can improve computational efficiency. It reduces the need to repeat similar operations on multiple datasets.
combined_df = pd.concat([text_df, text_df2, text_df3], ignore_index=True)
For the next part of the Data Preparation process, we are going to apply some cleaning and modification steps to achieve good quality and consistency in the dataset. It is important to note that all these steps might differ if we have different sources for the data. The code that I am providing is related towards the Twitter, Amazon and Reddit datasets.
The data cleaning steps that I decided to take on:
- General cleaning: Converting to lowercase, removing symbols, numbers, etc
- Removing stopwords: Removing stopwords such as 'the', 'and'
- Expanding common abbreviations: 'lol' becomes 'laugh out loud'
- Correcting misspelled words
- Filtering the text to exclude non-english words
The data cleaning steps that I decided not to take on:
- Expanding contractions: 'don't' to become 'do not'. In my opinion, this is going to strongly impact the modeling and is not needed in our case.
General Cleaning¶
Our next step is to create a function that cleans the text. In this function, we going to use regular expressions. In order to have a good foundation for the modeling, we need to achieve good quality in the Text column. The cleaning steps we are going to take on in this function:
- Converting to lowercase: This is a common step to ensure consistency in the text. For sentiment analysis, case sensitivity is not relevant in most of the situations, so converting to lowercase helps in standardizing the text.
- Removing Mentions (@username): We are removing the mentions of other people. Mentions of usernames do not convey sentiment.
- Removing Hashtags: Hashtags often contain topic-related keywords, but do not contribute directly to to sentiment classification. Removing them simplifies the text and reduces noise.
- Removing Retweet Tags (RT): In the tweets, it is common to have in the beggining of retweets, so we are removing those tags. Those tags are not indicative of sentiment, so removing them helps focus on the actual content.
- Removing URLs (Hyperlinks): Removing hyperlinks starting with 'http://' or 'https://'. Hyperlinks do not usually contribute to sentiment analysis and may lead to misleading results if included.
- Removing HTML entities: We are removing symbols such as & from the text.
- Removing Non-Alphanumeric Characters and Digits: Removing characters that are not letters to clean from symbols; removing the digits as well. These elements do not provide valuable information for sentiment analysis and can be considered noise.
- Removing Repeated Characters: Removing consecutive repeating characters, for example: 'loooooove' becomes 'love'. This helps in standartizing the text and focusiing more on the core sentiment-carrying words.
def cleanTxt(text):
text = text.lower()
text = re.sub(r'@[A-Za-z0-9]+', '', text)
text = re.sub(r'#', '', text)
text = re.sub(r'RT[\s]+', '', text)
text = re.sub(r'https?:\/\/\S+', '', text)
text = re.sub(r'&\w+;', '', text)
text = re.sub(r'[^a-zA-Z0-9\s]|(\d+)', '', text)
text = re.sub(r'\b\w*(\w)\1{2,}\w*\b', '', text)
return text
combined_df['Text'] = combined_df['Text'].apply(cleanTxt)
combined_df.sample(5)
| Text | Text_Length | Num_Words | |
|---|---|---|---|
| 21623 | now that their is a more local airport its nic... | 80 | 17 |
| 76579 | did those same people pointed out his christia... | 113 | 19 |
| 14699 | followfriday some tweeps whove chatted this w... | 131 | 14 |
| 47078 | there may be a book here but what i read isnt ... | 802 | 136 |
| 6845 | anyone on im all alone | 25 | 5 |
Removing Stopwords¶
After cleaning the text, we are going to remove common English stopwords like "the" and "and". We use a list of stopwords that is provided by the NLTK library. For each text in the "Text" column, it splits the text into words and keeps only those words that are not in the list of stopwords.
stop_words=stopwords.words('english')
combined_df['Text'] = combined_df['Text'].apply(lambda txt: ' '.join([word for word in txt.split() if word not in stop_words]))
Expanding Common Abbreviations¶
Afterwards, we are going to use a custom-made dictionary with abbreviations and their meaning. We are going to expand the abbreviations that we have in the text data. This dictionary is made by me and it consists with the most popular abbreviations that can be found on social media.
abbreviations_dictionary = {
"lol": "laugh out loud",
"omg": "oh my god",
"brb": "be right back",
"btw": "by the way",
"imho": "in my humble opinion",
"fyi": "for your information",
"afaik": "as far as I know",
"irl": "in real life",
"tl;dr": "too long; didn't read",
"smh": "shaking my head",
"op": "original poster",
"dm": "direct message",
"tmi": "too much information",
"tbh": "to be honest",
"iirc": "if I remember correctly",
"ama": "ask me anything",
"fomo": "fear of missing out",
"imo": "in my opinion",
"np": "no problem",
"oc": "original content",
"srsly": "seriously",
"yolo": "you only live once",
"ootd": "outfit of the day",
"smh": "shaking my head",
"btw": "by the way",
"idk": "I don't know",
"fyi": "for your information",
"tmi": "too much information",
"nsfw": "not safe for work",
"iktr": "I know that’s right",
"fomo": "fear of missing out",
"ikyl": "I know you’re lying",
"smh": "shaking my head",
"tbh": "to be honest",
"rofl": "rolling on the floor laughing",
"xoxo": "hugs and kisses",
"ikr": "I know, right?",
"ofc": "of course",
"tfw": "the feel when",
"tfti": "thanks for the invite",
"b/w": "between",
"dl": "down-low",
"fr": "for real",
"qt": "cutie",
"w/e": "whatever",
"w/o": "without",
"ngl": "not gonna lie",
"obv": "obviously",
"grl": "girl",
"ASAP": "as soon as possible",
"RSVP": "répondez s’il vous plaît",
"FAQs": "frequently asked questions",
"TGIF": "thank god it’s Friday",
"IMO": "in my opinion",
"IMHO": "in my humble opinion",
"N/A": "not available",
"DIY": "do-it-yourself",
"FIY": "for your information",
"AKA": "also known as",
"FKA": "formerly known as",
"BYOB": "bring your own beverage",
"BO": "body odor",
"ETA": "estimated time of arrival",
"Q&A": "questions and answers",
"ID": "identification",
"RIP": "rest in peace",
"VIP": "very important person",
"i.e.": "in other words",
"i.g.": "for example",
"PIN": "personal identification number",
"SOS": "save our ship (help)",
"SO": "significant other",
"TMI": "too much information",
"POV": "point of view",
"HBD": "happy birthday",
"lmk": "let me know",
"nvm": "nevermind",
"omw": "on my way",
"idk": "I don’t know",
"thx": "thanks",
"ty": "thank you",
"brb": "be right back",
"btw": "by the way",
"omg": "oh my god",
"lmao": "laughing my ass off",
"wtf": "what the ****",
"wth": "what the hell",
"iykyk": "if you know, you know",
"sthu": "shut the hell up",
"yolo": "you only live once",
"TL;DR": "too long; didn’t read",
"2day": "today",
"2moro": "tomorrow",
"atm": "at the moment",
"b4": "before",
"l8r": "later",
"cu": "see you",
"cya": "see ya",
"gr8": "great",
"ily": "I love you",
"ily2": "I love you too",
"pls": "please",
"r u srs": "are you serious?",
"y": "why?",
"ttyl": "talk to you later",
"bc": "because",
"DM": "direct message",
"ftw": "for the win",
"iirc": "if I remember correctly",
"jk": "just kidding",
"nbd": "no big deal",
"rn": "right now",
"DAE": "does anyone else?",
"hmu": "hit me up",
"wyd": "what (are) you doing?",
"idc": "I don’t care",
"h8": "hate",
"pic": "picture",
"ftm": "for the win"
}
def get_expansion_from_dictionary(abbreviation):
return abbreviations_dictionary.get(abbreviation, abbreviation)
print("\nModified Text:")
modifications_printed = 0
for i, original_text in enumerate(combined_df['Text']):
modified_text = ' '.join([get_expansion_from_dictionary(word) for word in original_text.split()])
if original_text != modified_text:
print(f"{i+1}. Original: {original_text}\n Modified: {modified_text}")
modifications_printed += 1
if modifications_printed == 10:
break
Modified Text: 2. Original: chilling emily lol nthen tomorrow run mile p Modified: chilling emily laugh out loud nthen tomorrow run mile p 12. Original: got email fake facebook requesting find friends mei none lol Modified: got email fake facebook requesting find friends mei none laugh out loud 46. Original: irritates sched appt client wk cancel min news made nd starbucks stop today lol Modified: irritates sched appt client wk cancel min news made nd starbucks stop today laugh out loud 75. Original: picture lol ducks pretty cool Modified: picture laugh out loud ducks pretty cool 83. Original: sunburnt badly actully feels shit yet im sick tbh Modified: sunburnt badly actully feels shit yet im sick to be honest 91. Original: yes would follow would cool xoxo Modified: yes would follow would cool hugs and kisses 135. Original: lol cut bangs like kate hudson bride wars months ago super cute looks diff time Modified: laugh out loud cut bangs like kate hudson bride wars months ago super cute looks diff time 137. Original: think mention follow thats like days lol coffee Modified: think mention follow thats like days laugh out loud coffee 138. Original: well move know going yay lol Modified: well move know going yay laugh out loud 190. Original: night museum tonite instead oh well yr old better enjoy lol Modified: night museum tonite instead oh well yr old better enjoy laugh out loud
To show the results, I decided to print the first 10 rows, that have been modified after going through the dictionary. We can clearly see the modifications that have been made, as for example in number 2 - lol becomes laugh out loud.
Fixing Misspelled words¶
We are going to use the 'SpellChecker' class to automatically correct misspelled words in the 'Text' column. The corrected words are joined back in a sentence, and the correct text is returned. This approach is vital for improving the accuracy and readability of the text. Furthermore, we are printing the number of corrections made.
from spellchecker import SpellChecker
spell = SpellChecker()
def correct_spelling(text):
if text is not None:
words = text.split()
corrected_words = [spell.correction(word) if spell.correction(word) is not None else word for word in words]
corrected_text = ' '.join(corrected_words)
return corrected_text
else:
return text
combined_df['Text'] = combined_df['Text'].apply(correct_spelling)
combined_df.sample(5)
| Text | Text_Length | Num_Words | |
|---|---|---|---|
| 78263 | completed bangalore random activities would ce... | 320 | 51 |
| 37958 | bought based fairly decent reviews great price... | 576 | 109 |
| 4433 | good nights sleep good workout morningmaybe wo... | 84 | 14 |
| 12649 | holy hell got sims mail gonna go install im ho... | 132 | 26 |
| 47494 | product sony production carries epic logo able... | 365 | 66 |
Filtering the Valid English Words¶
Filtering only the valid English words is an important step, however, in our case this function is taking extremely long. Due to time constraints and energy efficiency, this function is going to be commented out. To address this issue, in the Preprocessing steps, we are going to check if the feature names are in the English Vocabulary and filter them there.
# from nltk.corpus import words
# # nltk.download('words')
# def is_english_word(word):
# english_vocab = set(words.words())
# return word.lower() in english_vocab
# def remove_non_english(text):
# tokens = word_tokenize(text)
# english_tokens = [t for t in tokens if is_english_word(t) or not t.isalpha()]
# return ' '.join(english_tokens)
# combined_df['Filtered_Text'] = combined_df['Text'].apply(remove_non_english)
After the cleaning of the data, we are going to focus on the sentiment of the text. From the EDA, it was concluded that all of the datasets had already sentiment labels for each row, however, 2 of the datasets had only positive and negative values. Now, we are going to determine the polarity for each row. We are going to do this using 2 different tools:
- TextBlob
- VADER (with its component Sentiment Intensity Analyzer)
Firstly, we are starting with the TextBlob. TextBlob is a simple and easy-to-use library. It is a good choice to start with the project and first modelling. It provides polarity scores ranging from -1(negative) to 1(positive). It is not fine-grained and it can be limiting for more nuanced analyses.
def polarity(text):
return TextBlob(text).sentiment.polarity
combined_df['Polarity'] = combined_df['Text'].apply(polarity)
combined_df.sample(5)
| Text | Text_Length | Num_Words | Polarity | |
|---|---|---|---|---|
| 19424 | eating healthy snack toblerone | 36 | 6 | 0.500000 |
| 48861 | juicer less days already sent curb less months... | 765 | 152 | -0.026389 |
| 51030 | read entire narnian series twice junior high l... | 696 | 124 | 0.108571 |
| 76741 | party becomes religion leader becomes god crit... | 102 | 16 | 0.000000 |
| 80550 | pressure mounting mode takes metro delhi peopl... | 130 | 21 | -0.200000 |
The next step it to create columns Sentiment for the dataset based on the Polarity.
def sentiment(label):
if label <0:
return "Negative"
elif label ==0:
return "Neutral"
elif label>0:
return "Positive"
combined_df['Sentiment'] = combined_df['Polarity'].apply(sentiment)
combined_df.sample(10)
| Text | Text_Length | Num_Words | Polarity | Sentiment | |
|---|---|---|---|---|---|
| 8578 | sad cousin fiancee leaving florida | 71 | 11 | -0.500000 | Negative |
| 23664 | wake ride | 53 | 12 | 0.000000 | Neutral |
| 84055 | holland nu anti national | 27 | 4 | 0.000000 | Neutral |
| 72765 | anything hindus works conservatives agenda nat... | 924 | 148 | 0.092262 | Positive |
| 81218 | bop got one thing right saying chart hate mein... | 93 | 17 | -0.257143 | Negative |
| 17803 | lost wallet last nite whole lotta thus never k... | 79 | 17 | 0.100000 | Positive |
| 14196 | agree disagree part wish fall sunny person | 105 | 21 | 0.000000 | Neutral |
| 22782 | ty kindness enjoy im doingmore enjoy benefit l... | 136 | 26 | 0.400000 | Positive |
| 16195 | thank your kind | 48 | 7 | 0.600000 | Positive |
| 44503 | really excited see cranberries compiled videos... | 272 | 48 | 0.334596 | Positive |
Secondly, we are going to use VADER which provides sentiment scores for positivity, negativity and neutrality. We are going to use Sentiment Intensity Analyzer which is a component of VADER, but they are typically referred as the same tool. Along with the already mentioned scores, we are also going to get a compound score which is derived from the other 3 to conclude what the polarity of the given text is. Using VADER, we can get more nuances on the polarity of a text rather than only a label.
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm
# nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
res = []
for i, row in combined_df.iterrows():
text = row['Text']
sentiment_scores = sia.polarity_scores(text)
res.append(sentiment_scores)
result_df = pd.DataFrame(res)
combined_df = pd.concat([combined_df, result_df], axis=1)
combined_df.sample(5)
| Text | Text_Length | Num_Words | Polarity | Sentiment | neg | neu | pos | compound | |
|---|---|---|---|---|---|---|---|---|---|
| 91071 | sorry comment | 14 | 2 | -0.500000 | Negative | 0.565 | 0.435 | 0.000 | -0.0772 |
| 54946 | used inverter several times already worked wel... | 819 | 154 | 0.153846 | Positive | 0.000 | 0.835 | 0.165 | 0.9118 |
| 28198 | yea figures hed work saturday graces day month... | 127 | 22 | 0.800000 | Positive | 0.000 | 0.536 | 0.464 | 0.7783 |
| 86660 | sentirsi chambre college carabinieri totalment... | 240 | 30 | 0.375000 | Positive | 0.000 | 0.942 | 0.058 | 0.2023 |
| 66138 | conflicting numbers can ibn regional channels ... | 81 | 13 | 0.000000 | Neutral | 0.252 | 0.748 | 0.000 | -0.4019 |
After making all these changes and preparing the datasets, we are going to check again for missing values to see if some values from the Text were deleted, if they contained only symbols.
combined_df.isnull().sum()
Text 0 Text_Length 0 Num_Words 0 Polarity 0 Sentiment 0 neg 0 neu 0 pos 0 compound 0 dtype: int64
Since from the original datasets, we decided to keep only the Text column, we did not have much options and data to explore. However, after the preparation of the data, which included the cleaning of the text and determining the Polarity with TextBlob and VADER, we can do again a bit of Data Understanding. Now, we have not only the Text which is an object, but also numerical features which we can explore. As well as the statistical characteristics that we are going to get, we are going to do some visualizations as well to explore the words in the dataset and the distribution of the classes and the sentiment scores.
Mean (Average)¶
The mean represents the average value of a set of data points. In this dataset, we can calculate the mean for the numerical attributes like "Polarity", "neg", "pos", "neu", and "compound". The mean will provide us a central measure of these values.
mean_polarity = combined_df['Polarity'].mean()
mean_polarity
0.09893055240952335
For the Polarity, the mean value indicates that, on average, the sentiments are slightly positive. This means that, in general, the text data contains more positive statements than negative ones.
mean_neg = combined_df['neg'].mean()
mean_neg
0.1255628261742272
The mean of the neg column suggests that, on average, the data has some degree of negative sentiment, however, it is close to 0.
mean_pos = combined_df['pos'].mean()
mean_pos
0.2220522805175555
The mean of the pos is higher than the negative column, indicating that, on average, the text has slightly more positive sentiment.
mean_neu = combined_df['neu'].mean()
mean_neu
0.6466333981821737
The mean of the neu shows that a significant part of the dataset goes towards the neutral sentiment.
mean_compound = combined_df['compound'].mean()
mean_compound
0.1979543237707027
The mean of the compound score is positive, leading to the conclusion that the text data in the dataset has a positive overall sentiment.
Mode¶
The mode is the most frequently occuring value in the dataset. We can find the mode for a categorical value like the "Sentiment".
mode_sentiment = combined_df['Sentiment'].mode().iloc[0]
mode_sentiment
'Positive'
Standard deviation¶
The standard deviation measures how much data values vary from the mean. If the std is low, it means most of the data points are close to the average. If the standard deviation is high, it means the data points are more scattered and away from the average.
std_polarity = combined_df['Polarity'].std()
std_polarity
0.28825081057452745
std_neg = combined_df['neg'].std()
std_neg
0.17429167734060336
std_pos = combined_df['pos'].std()
std_pos
0.21527635026122413
std_neu = combined_df['neu'].std()
std_neu
0.2367692980584926
The standart deviation of the Polarity, neg, pos, and neu suggest that the sentiment polarities/scores in the dataset are moderately spread out from the mean.
std_compound = combined_df['compound'].std()
std_compound
0.5355754928638831
On the other hand, the compound standard deviation suggests a more widely spread scores from the average. This means that the compound score has a significant degree in variability, indicating a wide range of sentiment expressions, rather than being tightly clustered aroung a single sentiment value.
Distribution of the numerical values¶
plt.figure()
combined_df["Polarity"].plot.hist(bins=20, edgecolor='k', alpha=0.7)
plt.title('Polarity Distribution from TextBlob', fontsize=16)
plt.xlabel('Polarity')
plt.ylabel('Frequency')
plt.show()
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
sns.histplot(combined_df["pos"], bins=20, color='green', kde=True, ax=axes[0])
axes[0].set_title('Positive Score')
axes[0].set_xlabel('Score')
axes[0].set_ylabel('Density')
sns.histplot(combined_df["neg"], bins=20, color='red', kde=True, ax=axes[1])
axes[1].set_title('Negative Score')
axes[1].set_xlabel('Score')
axes[1].set_ylabel('Density')
sns.histplot(combined_df["neu"], bins=20, color='blue', kde=True, ax=axes[2])
axes[2].set_title('Neutral score')
axes[2].set_xlabel('Score')
axes[2].set_ylabel('Density')
plt.suptitle('Score Distribution Analysis from VADER', fontsize=16)
plt.tight_layout()
plt.show()
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
plt.figure()
combined_df["compound"].plot.hist(bins=20, edgecolor='k', alpha=0.7)
plt.title('Compound Score Distribution from VADER', fontsize=16)
plt.xlabel('Compound')
plt.ylabel('Frequency')
plt.show()
Distribution of the Sentiment provided by TextBlob¶
Now, we are going to see the distribution of positive, negative and neutral text in our dataset, using countplot and pie plot.
custom_palette = ["#3498db", "#e74c3c", "#2ecc71"]
fig, ax = plt.subplots(figsize=(6, 4))
sns.countplot(x='Sentiment', data=combined_df, palette=custom_palette, ax=ax)
ax.set_title('Sentiment Distribution', fontsize=16)
ax.set_xlabel('Sentiment Category')
ax.set_ylabel('Count', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()
c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): c:\Users\Anna\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector):
fig = plt.figure(figsize=(7,7))
colors = ["#2ecc71", "#3498db", "#e74c3c"]
wp = {'linewidth':2, 'edgecolor':"black"}
tags = combined_df['Sentiment'].value_counts()
explode = (0.1,0.1,0.1)
tags.plot(kind='pie', autopct='%1.1f%%', shadow=True, colors = colors,
startangle=90, wedgeprops = wp, explode = explode, label='')
plt.title('Distribution of Sentiments', fontsize=16)
Text(0.5, 1.0, 'Distribution of Sentiments')
Text Exploration¶
all_words = ' '.join(combined_df['Text']).split()
word_counts = Counter(all_words)
common_words = word_counts.most_common(10)
print("Top 10 common words:", common_words)
Top 10 common words: [('book', 15360), ('one', 15354), ('like', 14794), ('good', 11741), ('would', 10647), ('get', 10085), ('time', 8688), ('people', 8202), ('great', 7914), ('even', 7241)]
Now, we are going to see the most common words related to each sentiment specifically. We are going to create the function again for code readability since the previous function is in the Data Understanding chapter.
def get_common_words_combined(dataframe, sentiment):
subset = dataframe[dataframe['Sentiment'] == sentiment]
all_words = ' '.join(subset['Text'][subset['Text'].notnull()]).lower().split()
word_counts = Counter(all_words)
common_words = word_counts.most_common(10)
return common_words
common_words_positive = get_common_words_combined(combined_df, 'Positive')
common_words_neutral = get_common_words_combined(combined_df, 'Neutral')
common_words_negative = get_common_words_combined(combined_df, 'Negative')
print("Top 10 common words for Positive Sentiment:", common_words_positive)
print("Top 10 common words for Neutral Sentiment:", common_words_neutral)
print("Top 10 common words for Negative Sentiment:", common_words_negative)
Top 10 common words for Positive Sentiment: [('book', 11877), ('one', 10916), ('good', 10803), ('like', 9825), ('great', 7570), ('would', 7550), ('get', 6440), ('time', 5768), ('really', 5629), ('much', 5300)]
Top 10 common words for Neutral Sentiment: [('like', 1307), ('get', 1090), ('one', 1001), ('im', 912), ('mode', 883), ('time', 783), ('bop', 749), ('india', 680), ('going', 670), ('back', 642)]
Top 10 common words for Negative Sentiment: [('like', 3662), ('one', 3437), ('book', 3220), ('would', 2557), ('get', 2555), ('people', 2421), ('bad', 2181), ('even', 2149), ('time', 2137), ('dont', 1680)]
combined_df['Num_Words'] = combined_df['Text'].apply(lambda x: len(str(x).split()))
print("Average number of words per text entry:", combined_df['Num_Words'].mean())
Average number of words per text entry: 21.121277625091356
common_word_lengths = {word: len(word) for word, _ in common_words}
print("Length of words used often:", common_word_lengths)
Length of words used often: {'book': 4, 'one': 3, 'like': 4, 'good': 4, 'would': 5, 'get': 3, 'time': 4, 'people': 6, 'great': 5, 'even': 4}
sentiments = combined_df['Sentiment'].unique()
fig, (ax1, ax2, ax3) = plt.subplots(nrows=len(sentiments), sharex=True, figsize=(7, 5))
for i, sentiment in enumerate(sentiments):
subset = combined_df[combined_df['Sentiment'] == sentiment]
color = 'red' if sentiment == 'Negative' else ('green' if sentiment == 'Positive' else 'blue')
ax = ax1 if sentiment == 'Negative' else (ax2 if sentiment == 'Neutral' else ax3)
ax.scatter(subset['Num_Words'], subset.index, color=color, label=f'Sentiment {sentiment}', alpha=0.5)
ax.set_title(f'Number of Words vs. Sentiment (Sentiment {sentiment})')
ax.set_xlabel('Number of Words')
ax.set_ylabel('Index')
ax.legend()
plt.tight_layout()
plt.show()
In the next code snippets, we are going to create three word plots to display the most used words in the positive, negative and neutral texts.
positive_text = combined_df[combined_df['Sentiment'] == 'Positive']['Text']
negative_text = combined_df[combined_df['Sentiment'] == 'Negative']['Text']
neutral_text = combined_df[combined_df['Sentiment'] == 'Neutral']['Text']
def generate_and_plot_wordcloud(text, title, ax):
wordcloud = WordCloud(width=400, height=200, random_state=21, max_font_size=119, background_color='white').generate(text)
ax.imshow(wordcloud, interpolation="bilinear")
ax.axis('off')
ax.set_title(title, fontsize=12)
fig, axs = plt.subplots(1, 3, figsize=(17, 5))
generate_and_plot_wordcloud(' '.join(positive_text), 'Positive Sentiment', axs[0])
generate_and_plot_wordcloud(' '.join(negative_text), 'Negative Sentiment', axs[1])
generate_and_plot_wordcloud(' '.join(neutral_text), 'Neutral Sentiment', axs[2])
plt.suptitle('Word Clouds for Different Sentiments', fontsize=16)
plt.tight_layout()
plt.show()
🔮 Predictions¶
Now, we are going to Phase 3 of the AI Project Methodology. We are going to preprocess the data so that a computer can read it, put it into a model or models and evaluate the results, using reports and also inputting new data as well.
🧹 Preprocessing¶
In this initial step of the data preprocessing, we focus on enhancing the quality of the text data by employing tokenization. This is a process of splitting text into individual words or tokens, making it better for analysis and feature extraction. This step is crucial for understanding the structure of the text and for a task such as sentiment analysis.
from nltk.tokenize import word_tokenize
nltk.download('punkt')
combined_df['Text'] = combined_df['Text'].apply(lambda text: word_tokenize(text))
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\Anna\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
In this second step of data preprocessing, we address the tokens that were obtained through the tokenization. After tokenization, the text is divided into its constituent words, making it more suitbale for further analysis. To optimize the data structure and maintain the integrity of the text. we use the 'join' function to reassemble these tokens into coherent sentences.
combined_df['Text'] = combined_df['Text'].apply(lambda tokens: ' '.join(tokens))
combined_df.sample(10)
| Text | Text_Length | Num_Words | Polarity | Sentiment | neg | neu | pos | compound | |
|---|---|---|---|---|---|---|---|---|---|
| 82234 | mistakes shaw whole game till date started cel... | 720 | 80 | 0.210000 | Positive | 0.125 | 0.620 | 0.255 | 0.9112 |
| 74925 | newsreader struggling much speak inherently di... | 71 | 7 | -0.150000 | Negative | 0.515 | 0.485 | 0.000 | -0.6486 |
| 2877 | mom hair dried shes using razor shes pulling s... | 104 | 9 | -0.200000 | Negative | 0.000 | 1.000 | 0.000 | 0.0000 |
| 48565 | willie nelson redheaded stranger released howe... | 116 | 11 | 0.600000 | Positive | 0.000 | 0.735 | 0.265 | 0.5574 |
| 91322 | disgraceful | 12 | 1 | 0.000000 | Neutral | 0.000 | 1.000 | 0.000 | 0.0000 |
| 45526 | looking something soothe aching legs frequent ... | 156 | 13 | 0.050000 | Positive | 0.324 | 0.541 | 0.135 | -0.5423 |
| 57961 | flat feet used excruciating foot pain every da... | 657 | 64 | -0.029167 | Negative | 0.275 | 0.575 | 0.149 | -0.9246 |
| 31326 | besides fact love nin ive waiting year release... | 359 | 31 | 0.330272 | Positive | 0.060 | 0.603 | 0.336 | 0.9169 |
| 79532 | drug congratulations incredible work youtube t... | 279 | 26 | 0.367857 | Positive | 0.000 | 0.556 | 0.444 | 0.9501 |
| 58758 | fun moviegreat similar groundhog day watch mai... | 241 | 22 | 0.295833 | Positive | 0.000 | 0.531 | 0.469 | 0.9325 |
For this project, I decided to try 2 types of vectorization techniques. The first one is the 'CountVectorizer'. It is a tool that's like a magic wand for turning words into numbers. When we analyze text, computers prefer numbers, so this 'CountVectorizer' helps us convert the words in our 'Text' data into a numerical format. The 'ngram_range=(1,2)' part tells the CountVectorizer to consider both single words and pairs of words, which can provide more context. Once we use this 'CountVectorizer,' we'll have a numerical representation of our text that makes it easier for the computer to understand and analyze.
The other vector is the 'TfifdVectorizer'. Both of them are methods for converting text data into vectors. However, the TfifdVectorizer considers overall document weightage of a word. This vector counts by a measure of how often the words appear in the documents. Whereas, the CountVectorizer counts the number of times a word appears in the document.
Both of these methods have their pros and cons and that is why I will train the models once with both of them and then make a conclusion on which one to use.
# from sklearn.feature_extraction.text import TfidfVectorizer
# vect = TfidfVectorizer(ngram_range=(1, 2)).fit(combined_df['Text'])
vect = CountVectorizer(ngram_range=(1,2)).fit(combined_df['Text'])
'feature_names' is like a list of the unique words and word pairs found in our text. First, we're checking how many of these unique features there are. Then, we're showing the first 20 of these features to get an idea of what words or combinations the computer will use for analysis. It helps us understand how the computer sees our text.
feature_names = vect.get_feature_names_out()
keys = vect.vocabulary_.keys()
print("Number of features: {}\n".format(len(feature_names)))
print("First 20 features:\n {}".format(feature_names[:20]))
print(list(keys)[:20])
Number of features: 1380263 First 20 features: ['aabaasa' 'aabaasa ak' 'aacsrunning' 'aacsrunning time' 'aadhaar' 'aadhaar act' 'aadhaar afraid' 'aadhaar anything' 'aadhaar app' 'aadhaar authentication' 'aadhaar bank' 'aadhaar billion' 'aadhaar biometric' 'aadhaar but' 'aadhaar card' 'aadhaar could' 'aadhaar current' 'aadhaar data' 'aadhaar dataprotection' 'aadhaar digit'] ['im', 'thinking', 'gon', 'na', 'myspace', 'bit', 'checking', 'im thinking', 'thinking im', 'im gon', 'gon na', 'na myspace', 'myspace bit', 'bit checking', 'chilling', 'emily', 'lol', 'then', 'tomorrow', 'run']
After a thorough research of the features of the CountVectorizer, I concluded that the sequences of 'aa' that are present in our first 20 features, are abbreviations that the vector found in the text. To address this, we are going to go through the feature_names and check if the words appear in the english vocabulary. We are importing the 'words' from nltk and using them, we filter our feature_names.
nltk.download('words')
from nltk.corpus import words
valid_words = set(words.words())
feature_names = [feature for feature in feature_names if feature in valid_words]
print("Number of features: {}\n".format(len(feature_names)))
print("First 20 features:\n {}".format(feature_names[:20]))
[nltk_data] Downloading package words to [nltk_data] C:\Users\Anna\AppData\Roaming\nltk_data... [nltk_data] Package words is already up-to-date!
Number of features: 23753 First 20 features: ['aback', 'abacus', 'abalone', 'abandon', 'abandoned', 'abandonment', 'abase', 'abased', 'abatement', 'abattoir', 'abbas', 'abbey', 'abbot', 'abbreviation', 'abdicate', 'abdominal', 'abduct', 'abduction', 'abed', 'aberrant']
🪓 Splitting into train/test¶
Now, we're preparing our text data for analysis. 'X' represents our text, while 'Y' represents the corresponding sentiment labels. We use 'vect.transform(X)' to convert our text into a format that the computer can understand for analysis.
X = combined_df['Text']
Y = combined_df['Sentiment']
X = vect.transform(X)
Now, we're splitting our prepared data into two parts: 'x_train' and 'x_test' represent the portions of our data used for training and testing, while 'y_train' and 'y_test' represent the corresponding sentiment labels. By doing this, we're setting aside a portion of our data (20% in this case) for testing the performance of our analysis later. The 'random_state=42' ensures that the data split is consistent every time we run this code.
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print("Size of x_train:", (x_train.shape))
print("Size of y_train:", (y_train.shape))
print("Size of x_test:", (x_test.shape))
print("Size of y_test:", (y_test.shape))
Size of x_train: (77719, 1380263) Size of y_train: (77719,) Size of x_test: (19430, 1380263) Size of y_test: (19430,)
🧬 Modelling¶
The next step is modelling. We are going to create a few base models and then an ensemble model to get better predictions. The base models are going to be: Logistic Regression, Naive Bayes, SVM and Random Forest.
import warnings
warnings.filterwarnings('ignore')
In this section, we're using a machine learning algorithm called Logistic Regression to build a model. We train the model with our training data ('x_train' and 'y_train') to help it learn patterns in the text data and their corresponding sentiment labels. Then, we use this trained model to make predictions on our test data ('x_test') and compare those predictions with the actual sentiment labels ('y_test'). We choose Logistic Regression because it's a simple and efficient method for understanding and predicting sentiment in text.
logreg = LogisticRegression(C=100, penalty='l2')
logreg.fit(x_train, y_train)
LogisticRegression(C=100)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=100)
Now, we are going to employ the 'GridSearchCV' technique to optimize the hyperparameters of the Logistic Regression. We are going to do that for all the other models as well. This process is essential for fine-tuning the model's performance. All the grid searches are already executed, hoewever now they are declared in functions, because of the long time they take to run. The results for the Logistic Regression indicate that the most suitable hyperparameters are a regularization parameter 'C' set to 100 and the 'penalty' method as l2.
- 'C' controls the degree of regularization applied to the model. A smaller 'C' value increases the regularization stength, leading to simpler model. On the other hand, a larger 'C' value reduces regularization allowing the model to fit more closely the training data.
- The 'penalty' specifies the type of regularization to be used. In simple words, 'l1' penalty (Lasso) focuses on a small number of features and ignores the rest, while the 'l2' penalty (Ridge) balances the model more by including all features.
The accuracy achieved using these optimal parameters was an impressive 92%.
from sklearn.model_selection import GridSearchCV
def logistic_regression_grid_search(x_train, y_train):
param_grid_lr = {
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2'],
}
logistic_regression = LogisticRegression()
grid_search_lr = GridSearchCV(logistic_regression, param_grid_lr, cv=5, scoring='accuracy')
grid_search_lr.fit(x_train, y_train)
best_params_lr = grid_search_lr.best_params_
best_score_lr = grid_search_lr.best_score_
return best_params_lr, best_score_lr
Afterwards, we are employing a machine learning algorithm known as Multinomial Naive Bayes. Multinomial Naive Bayes is a suitable choice for our task as it is well-suited for text classification and has been successful in various natural language processing applications due to its simplicity and efficiency.
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB(alpha=0.5)
naive_bayes.fit(x_train, y_train)
MultinomialNB(alpha=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB(alpha=0.5)
Now, we are optimizing the hyperparameters of the Naive Bayes model. The results for the Naive Bayes model reveal that the best hyperparameters are an 'alpha' value of 0.5, and this leads to an accuracy of 65%. The 'alpha' parameter in Naive Bayes controls the smoothing applied to the model. An 'alpha' of 0.5 suggests moderate smoothing, which helps the model generalize better from the training data. This optimal configuration achieved a 65% accuracy.
def naive_bayes_grid_search(x_train, y_train):
param_grid_nb = {
'alpha': [0.1, 0.5, 1.0, 1.5, 2.0],
}
naive_bayes = MultinomialNB()
grid_search_nb = GridSearchCV(naive_bayes, param_grid_nb, cv=5, scoring='accuracy')
grid_search_nb.fit(x_train, y_train)
best_params_nb = grid_search_nb.best_params_
best_score_nb = grid_search_nb.best_score_
return best_params_nb, best_score_nb
Next, we are using a Support Vector Machine (SVM) with a linear kernel. SVMs are chosen for their ability to create an effective linear boundary for separating sentiment classes in text data, making them efficient and interpretable. The linear kernel is used because it is suitable when the data is expected to be linearly separable, a common characteristic in text sentiment analysis where words or features can be effectively separated by a straight line.
from sklearn.svm import SVC
svm = SVC(C=1, kernel='linear', probability=True)
svm.fit(x_train, y_train)
SVC(C=1, kernel='linear', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=1, kernel='linear', probability=True)
We are continuing to optimize our models and the next in line is the Support Vector Machines. The results of the SVM model indicate that the best hyperparameters are a 'C' value set to 1 and 'kernel' type set as 'linear'.
- The 'C' parameter determines the trade-off between maximizing the margin and minimizing the classification error. A smaller 'C' value emphasizes a wider margin, while a larger 'C' value focuses on correctly classifying each training point.
- The kernel is responsible for transforming the input data into a higher-dimensional space. The 'linear' kernel is a suitable choice for us, because the data can be separated effectively by a linear decision boundary.
This optimal configuration results in an 92% accuracy.
def svc_grid_search(x_train, y_train):
param_grid_svc = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
}
svc = SVC()
grid_search_svc = GridSearchCV(svc, param_grid_svc, cv=5, scoring='accuracy')
grid_search_svc.fit(x_train, y_train)
best_params_svc = grid_search_svc.best_params_
best_score_svc = grid_search_svc.best_score_
return best_params_svc, best_score_svc
Lastly, we are utilizing a Random Forest classifier. Random Forest is chosen because it's a versatile ensemble learning method that combines multiple decision trees, making it capable of capturing complex relationships in text data. The 'n_estimators' parameter is set to 100, meaning we're using 100 decision trees in the forest for robust predictive performance.
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=200,max_depth=None, random_state=42)
random_forest.fit(x_train, y_train)
RandomForestClassifier(n_estimators=200, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_estimators=200, random_state=42)
Now, we are going to optimize the Random Forest Classifier. The results suggest n_estimators to be 200 and max_depth to be None.
- The 'n_estimators' parameter specifies the number of trees in the Random Forest ensemble. A higher number of trees can provide better generalization but might increase computational complexity.
- The 'max_depth' controls the maximum depth of the trees within the Random Forest. A deeper tree may capture more intricate patterns in the data but might lead to overfitting, while a shallower tree may prevent the overfitting but not capture that much detail.
Using these hyperparameter values the accuracy achieved is 83%.
def random_forest_grid_search(x_train, y_train):
param_grid_rf = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
}
rf_classifier = RandomForestClassifier()
grid_search_rf = GridSearchCV(rf_classifier, param_grid_rf, cv=5, scoring='accuracy')
grid_search_rf.fit(x_train, y_train)
best_params_rf = grid_search_rf.best_params_
best_score_rf = grid_search_rf.best_score_
return best_params_rf, best_score_rf
🧐 Evaluation¶
Now, we are going to evaluate the perfomance of each model by looking at common evaluation metrics such as Accuracy. Furthermore, for each model, we are going to print confusion matrix and classification report to evaluate the Precision, Recall, F1-score and Support. Moreover, we are going to test the predictions of the ensemble model by introducing him to 3 new reviews - 1 positive, 1 negative and 1 neutral.
logreg_pred = logreg.predict(x_test)
logreg_acc = accuracy_score(logreg_pred, y_test)
print("Test accuracy: {:.2f}%".format(logreg_acc*100))
Test accuracy: 93.87%
print(confusion_matrix(y_test, logreg_pred))
print("\n")
print(classification_report(y_test, logreg_pred))
[[3703 173 388]
[ 70 4907 71]
[ 304 186 9628]]
precision recall f1-score support
Negative 0.91 0.87 0.89 4264
Neutral 0.93 0.97 0.95 5048
Positive 0.95 0.95 0.95 10118
accuracy 0.94 19430
macro avg 0.93 0.93 0.93 19430
weighted avg 0.94 0.94 0.94 19430
The 'confusion_matrix' function compares our model's predictions with the actual sentiment labels in our test data. It tells us how many times our model got it right and how many times it made mistakes.
style.use('classic')
cm = confusion_matrix(y_test, logreg_pred, labels=logreg.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=logreg.classes_)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1948dc1bf50>
The results of the Logistic Regression show a solid performance in sentiment analysis, with an overall accuracy of 92% and with the bigger sample of the datasets - 93.87%. The model exhibits strong precision, recall, and F1-score values for each sentiment class, indicating its ability to effectively classify sentiments, with particularly high performance in the Neutral and Positive categories.
naive_bayes_pred = naive_bayes.predict(x_test)
naive_bayes_acc = accuracy_score(naive_bayes_pred, y_test)
print("Naive Bayes Test accuracy: {:.2f}%".format(naive_bayes_acc * 100))
Naive Bayes Test accuracy: 66.70%
print(confusion_matrix(y_test, naive_bayes_pred))
print("\n")
print(classification_report(y_test, naive_bayes_pred))
[[3033 136 1095]
[ 922 2004 2122]
[1875 321 7922]]
precision recall f1-score support
Negative 0.52 0.71 0.60 4264
Neutral 0.81 0.40 0.53 5048
Positive 0.71 0.78 0.75 10118
accuracy 0.67 19430
macro avg 0.68 0.63 0.63 19430
weighted avg 0.70 0.67 0.66 19430
style.use('classic')
cm = confusion_matrix(y_test, naive_bayes_pred, labels=naive_bayes.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=naive_bayes.classes_)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x194f2fa7dd0>
In summary, the results from the Naive Bayes indicate a less favorable performance in sentiment analysis, with an accuracy of 65%, and with the bigger sample of the datasets - 66.70%. The model shows lower precision and recall values, especially for the Neutral class, suggesting challenges in accurately classifying that sentiment. While it demonstrates relatively higher recall for the Positive class, overall performance lags behind the other models.
svm_pred = svm.predict(x_test)
svm_acc = accuracy_score(svm_pred, y_test)
print("SVM Test accuracy: {:.2f}%".format(svm_acc * 100))
SVM Test accuracy: 93.60%
print(confusion_matrix(y_test, svm_pred))
print("\n")
print(classification_report(y_test, svm_pred))
[[3696 141 427]
[ 60 4931 57]
[ 395 163 9560]]
precision recall f1-score support
Negative 0.89 0.87 0.88 4264
Neutral 0.94 0.98 0.96 5048
Positive 0.95 0.94 0.95 10118
accuracy 0.94 19430
macro avg 0.93 0.93 0.93 19430
weighted avg 0.94 0.94 0.94 19430
style.use('classic')
cm = confusion_matrix(y_test, svm_pred, labels=svm.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=svm.classes_)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1948501cd10>
These results illustrate a strong performance in sentiment analysis, with an impressive accuracy of 93%. The SVM model exhibits high precision, recall, and F1-score values for each sentiment class, indicating its ability to effectively and consistently classify sentiments, with robust performance in the Neutral category.
random_forest_pred = random_forest.predict(x_test)
random_forest_acc = accuracy_score(random_forest_pred, y_test)
print("Random Forest Test accuracy: {:.2f}%".format(random_forest_acc * 100))
Random Forest Test accuracy: 83.71%
print(confusion_matrix(y_test, random_forest_pred))
print("\n")
print(classification_report(y_test, random_forest_pred))
[[1676 570 2018]
[ 16 4872 160]
[ 71 331 9716]]
precision recall f1-score support
Negative 0.95 0.39 0.56 4264
Neutral 0.84 0.97 0.90 5048
Positive 0.82 0.96 0.88 10118
accuracy 0.84 19430
macro avg 0.87 0.77 0.78 19430
weighted avg 0.85 0.84 0.82 19430
style.use('classic')
cm = confusion_matrix(y_test, random_forest_pred, labels=random_forest.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=random_forest.classes_)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1948d6c56d0>
For the last one of our base models, the results indicate a mixed performance in sentiment analysis, with an overall accuracy of 83%. The model demonstrates a high recall for the Neutral class. However, its precision is notably lower for the Negative sentiment. Overall, the model exhibits balanced but not exceptional performance.
All our base models have pros and cons, as well as weak spots and strong ones. However, one model is distinguished with its low accuracy and classification report. This model is Naive Bayes. The accuracy achieved is 65% (66%). Several factors contribute to this perfomance. Firstly, Naive Bayes is inherently based on a "naive" assumption that each feature is independent of the others, which often is not true for natural language data. The text data is complex and frequently has complex dependencies between words. Additionally, Naive Bayes relies on a bag-of-words representation, which means it does not consider the order of the words in a sentence. This approach can result in a loss of valuable information, especially when dealing with Sentiment Analysis. Other models like Support Vector Machines, Random Forest, or Logistic Regression are better at capturing these nuances.
Nevertheless, I decided to include the Naive Bayes model in the ensemble. Ensemble methods work the best when the base models are diverse. If Naive Bayes makes different errors compared to the other models, it could compliment the ensemble. This decision can have its challenges, but for now the Naive Bayes is going to be part of the ensemble.
Now we need to combine all these base models to create an ensemble model with a Voting Classifier to make a final decision ('hard' voting) on sentiment analysis. The Voting Classifier leverages the strengths of these individual models to potentially improve overall prediction accuracy and generalization. It's then fitted with training data for sentiment analysis.
from sklearn.ensemble import VotingClassifier
voting_classifier = VotingClassifier(estimators=[
('logistic_regression', logreg),
# ('naive_bayes', naive_bayes),
('svm', svm),
('random_forest', random_forest)
], voting='hard')
voting_classifier.fit(x_train, y_train)
with open('voting_classifier.pkl', 'wb') as file:
pickle.dump(voting_classifier, file)
ensemble_pred = voting_classifier.predict(x_test)
ensemble_acc = accuracy_score(ensemble_pred, y_test)
print("Ensemble Test accuracy: {:.2f}%".format(ensemble_acc * 100))
Ensemble Test accuracy: 92.59%
print(confusion_matrix(y_test, ensemble_pred))
print("\n")
print(classification_report(y_test, ensemble_pred))
[[2526 144 283]
[ 49 3631 49]
[ 256 184 5908]]
precision recall f1-score support
Negative 0.89 0.86 0.87 2953
Neutral 0.92 0.97 0.94 3729
Positive 0.95 0.93 0.94 6348
accuracy 0.93 13030
macro avg 0.92 0.92 0.92 13030
weighted avg 0.93 0.93 0.93 13030
style.use('classic')
cm = confusion_matrix(y_test, random_forest_pred, labels=voting_classifier.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=voting_classifier.classes_)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x19128660890>
In conclusion, the ensemble model showcases strong performance in sentiment analysis, with an impressive accuracy of 92% with the CountVectorizer. It demonstrates robust precision, recall, and F1-score values for each sentiment category, indicating its capacity to effectively classify sentiments.
For the last part of the evaluation, I introduced the model to 2 new reviews - 1 positive and 1 negative. The model successfully classifies them.
negative_test = "I recently bought shoes and they were a disaster. I hate them! They are making my feet hurt during the workout. I would never repurchase them or recommend them."
positive_test = "I recently got nike shoes, and it's been a game-changer for my workouts. It boasts outstanding durability, excellent performance, and superior comfort. It's a must-have for fitness enthusiasts."
neutral_test = "It is meh. I do not have an opinion."
neg_text_vector = vect.transform([negative_test])
pos_text_vector = vect.transform([positive_test])
neu_text_vector = vect.transform([neutral_test])
prediction_test = voting_classifier.predict(neg_text_vector)
prediction2_test = voting_classifier.predict(pos_text_vector)
prediction3_test = voting_classifier.predict(neu_text_vector)
print("Negative text prediction: " + prediction_test + ". Positive text prediction: " + prediction2_test + ". Negative text prediction: " + prediction3_test + ".")
['Negative text prediction: Negative. Positive text prediction: Positive. Negative text prediction: Neutral.']
As I already mentioned, I trained the models with 2 types of vectorizers. However, for this project, I decided to continue with the CountVectorizer. It lead to more promising results and evaluation metrics. Nevertheless, I trained the models with the TfifdVectorizer as well and will save them as a pickle file.
🥒 Saving the Models Using Pickle¶
# saving the abbreviation dictionary
with open('abbreviations.json', 'w') as file:
json.dump(abbreviations_dictionary, file)
# saving the vectorizer
with open('vect.pkl', 'wb') as file:
pickle.dump(vect, file)
# saving the Logistic Regression model
with open('logreg.pkl', 'wb') as file:
pickle.dump(logreg, file)
# saving the Naive Bayes model
with open('naive_bayes.pkl', 'wb') as file:
pickle.dump(naive_bayes, file)
# saving the Support Vector Machines model
with open('svm.pkl', 'wb') as file:
pickle.dump(svm, file)
# saving the Random Forest model
with open('random_forest.pkl', 'wb') as file:
pickle.dump(random_forest, file)
# saving the Voting Classifier (ensemble model)
with open('voting_classifier.pkl', 'wb') as file:
pickle.dump(voting_classifier, file)
🔧 Functions - Text Prep and Analysis¶
This chapter, I will be dedicating to simplifying all of the above code to achieve functions that consist of all the steps needed to complete the sentiment analysis task, as well as add a new function to predict the percentages for each sentiment. The functions are going to use the already saved models and vectors.
Text Preparation and Preprocessing¶
The first step is going to be a creation of a simple function that can preprocess the text or prepare it for the analysis.
def clean_and_preprocess_text(text):
# lowering the text, removing hashtags,URLs, non-alphanumeric characters, repeating characters, single repeating characters
text = text.lower()
text = re.sub(r'#', '', text)
text = re.sub(r'https?:\/\/\S+', '', text)
text = re.sub(r'[^a-zA-Z0-9\s]|(\d+)', '', text)
text = re.sub(r'\b\w*(\w)\1{2,}\w*\b', '', text)
# removing stopwords
text = ' '.join([word for word in text.split() if word not in stop_words])
# Getting expansion from dictionary
text = ' '.join([get_expansion_from_dictionary(word) for word in text.split()])
# Correcting spelling
text = correct_spelling(text)
# Tokenizing and join
text = ' '.join(word_tokenize(text))
return text
Sentiment Analysis Task¶
The next step is to create a function that actually does the sentiment analysis task. In order to get the percentage of how Positive, Negative, and Neutral a text is, I explored the option to employ the method predict_proba and decision_function. The function is quite long, so let's break it down. The function is in 2 sections and it depends whether we inputted the model to be the voting classifier or one of the other models.
- The first part of the function for the Voting Classifier
In the beggining of this part of the function, we have a check to see if the provided model is a Voting Classifier by verifying the existence of 'estimators_' and 'classes_'. Afterwards, we are initializing the variables to store our results. Since the Voting Classifier with hard voting does not support predict_proba, we are going to be iterating through all of the base_estimators. We are using either predict_proba or decision_function depending on the base_estimator. Lastly, we are calculating the percentages and then getting the average percentage for each of the sentiment. Along with that, we are getting the majority sentiment, but also the sentiment predicted from the voting_classifier itself.
- The second part of the function for one model
This part of the function handles the case when the provided model is not Voting Classifier, we simply get the results from the model without any calculations like the previous section.
This function is the final function for the sentiment analysis and it is going to be used in the backend. An important thing to notice here is the decision that I made regarding the average percentages for each class. This means that every model contributes to the final percentage for the sentiments. Having this in mind, I went back and trained the ensemble model without the Naive Bayes, as it was making the results bad. I was lead to this decision after multiple tests with different reviews.
def analyze_sentiment(raw_text, preprocess_text, vect, model):
text_vector = vect.transform([preprocess_text])
if hasattr(model, 'estimators_') and hasattr(model, 'classes_'):
# If it's the VotingClassifier
result = {'raw': raw_text, 'preprocessed': preprocess_text, 'base_estimators': {}}
all_positive_percentages = []
all_neutral_percentages = []
all_negative_percentages = []
all_predicted_sentiments = []
for estimator_name, base_estimator in model.named_estimators.items():
if hasattr(base_estimator, 'predict_proba'):
predicted_sentiment_probs = base_estimator.predict_proba(text_vector)[0]
else:
predicted_sentiment_probs = base_estimator.decision_function(text_vector)[0]
negative_percent = predicted_sentiment_probs[0] * 100
neutral_percent = predicted_sentiment_probs[1] * 100
positive_percent = predicted_sentiment_probs[2] * 100
all_positive_percentages.append(positive_percent)
all_neutral_percentages.append(neutral_percent)
all_negative_percentages.append(negative_percent)
predicted_sentiment_label = base_estimator.classes_[predicted_sentiment_probs.argmax()]
predicted_sentiment_label_model = base_estimator.predict(text_vector)
base_result = {
"probability": predicted_sentiment_label,
"sentiment": predicted_sentiment_label_model[0],
"positive": f"{positive_percent:.2f}%",
"neutral": f"{neutral_percent:.2f}%",
"negative": f"{negative_percent:.2f}%"
}
result['base_estimators'][estimator_name] = base_result
all_predicted_sentiments.append(predicted_sentiment_label_model[0])
# average percentages
avg_positive_percent = sum(all_positive_percentages) / len(all_positive_percentages)
avg_neutral_percent = sum(all_neutral_percentages) / len(all_neutral_percentages)
avg_negative_percent = sum(all_negative_percentages) / len(all_negative_percentages)
majority_sentiment = max(set(all_predicted_sentiments), key=all_predicted_sentiments.count)
voting_sentiment = model.predict(text_vector)
result["positive"] = f"{avg_positive_percent:.2f}%"
result["neutral"] = f"{avg_neutral_percent:.2f}%"
result["negative"] = f"{avg_negative_percent:.2f}%"
result["majority_sentiment"] = majority_sentiment
result["voting_sentiment"] = voting_sentiment[0]
else:
# If it's one of the other models
if hasattr(model, 'predict_proba'):
predicted_sentiment_probs = model.predict_proba(text_vector)[0]
else:
predicted_sentiment_probs = model.decision_function(text_vector)[0]
negative_percent = predicted_sentiment_probs[0] * 100
neutral_percent = predicted_sentiment_probs[1] * 100
positive_percent = predicted_sentiment_probs[2] * 100
predicted_sentiment_label = model.classes_[predicted_sentiment_probs.argmax()]
predicted_sentiment_label_model = model.predict(text_vector)
result = {
"raw": raw_text,
"preprocessed": preprocess_text,
"probability": predicted_sentiment_label,
"sentiment": predicted_sentiment_label_model[0],
"positive": f"{positive_percent:.2f}%",
"neutral": f"{neutral_percent:.2f}%",
"negative": f"{negative_percent:.2f}%"
}
return result
negative = "I recently bought shoes and they were a disaster. I hate them! They are making my feet hurt during the workout. I would never repurchase them or recommend them."
positive = "I recently got nike shoes, and it's been a game-changer for my workouts. It boasts outstanding durability, excellent performance, and superior comfort. It's a must-have for fitness enthusiasts."
neutral = "It is meh. I do not have an opinion."
prediction = analyze_sentiment(negative, clean_and_preprocess_text(negative),vect, voting_classifier)
prediction2 = analyze_sentiment(positive, clean_and_preprocess_text(positive), vect, voting_classifier)
prediction3 = analyze_sentiment(neutral, clean_and_preprocess_text(neutral), vect, voting_classifier)
print(prediction)
print(prediction2)
print(prediction3)
{'raw': 'I recently bought shoes and they were a disaster. I hate them! They are making my feet hurt during the workout. I would never repurchase them or recommend them.', 'preprocessed': 'recently bought shoes disaster hate making feet hurt workout would never purchase recommend', 'base_estimators': {'logistic_regression': {'probability': 'Negative', 'sentiment': 'Negative', 'positive': '0.00%', 'neutral': '0.00%', 'negative': '100.00%'}, 'naive_bayes': {'probability': 'Positive', 'sentiment': 'Positive', 'positive': '97.64%', 'neutral': '0.00%', 'negative': '2.36%'}, 'svm': {'probability': 'Negative', 'sentiment': 'Negative', 'positive': '0.26%', 'neutral': '0.32%', 'negative': '99.43%'}, 'random_forest': {'probability': 'Positive', 'sentiment': 'Positive', 'positive': '43.00%', 'neutral': '30.50%', 'negative': '26.50%'}}, 'positive': '35.22%', 'neutral': '7.70%', 'negative': '57.07%', 'majority_sentiment': 'Negative'}
{'raw': "I recently got nike shoes, and it's been a game-changer for my workouts. It boasts outstanding durability, excellent performance, and superior comfort. It's a must-have for fitness enthusiasts.", 'preprocessed': 'recently got nike shoes gamechanger workouts boasts outstanding durability excellent performance superior comfort musthave fitness enthusiasts', 'base_estimators': {'logistic_regression': {'probability': 'Positive', 'sentiment': 'Positive', 'positive': '100.00%', 'neutral': '0.00%', 'negative': '0.00%'}, 'naive_bayes': {'probability': 'Positive', 'sentiment': 'Positive', 'positive': '100.00%', 'neutral': '0.00%', 'negative': '0.00%'}, 'svm': {'probability': 'Positive', 'sentiment': 'Positive', 'positive': '99.90%', 'neutral': '0.00%', 'negative': '0.10%'}, 'random_forest': {'probability': 'Positive', 'sentiment': 'Positive', 'positive': '65.50%', 'neutral': '32.50%', 'negative': '2.00%'}}, 'positive': '91.35%', 'neutral': '8.13%', 'negative': '0.52%', 'majority_sentiment': 'Positive'}
{'raw': 'It is meh. I do not have an opinion.', 'preprocessed': 'meh opinion', 'base_estimators': {'logistic_regression': {'probability': 'Neutral', 'sentiment': 'Neutral', 'positive': '0.57%', 'neutral': '99.14%', 'negative': '0.29%'}, 'naive_bayes': {'probability': 'Positive', 'sentiment': 'Positive', 'positive': '52.74%', 'neutral': '24.03%', 'negative': '23.23%'}, 'svm': {'probability': 'Neutral', 'sentiment': 'Neutral', 'positive': '6.60%', 'neutral': '89.36%', 'negative': '4.04%'}, 'random_forest': {'probability': 'Neutral', 'sentiment': 'Neutral', 'positive': '7.00%', 'neutral': '88.50%', 'negative': '4.50%'}}, 'positive': '16.73%', 'neutral': '75.26%', 'negative': '8.02%', 'majority_sentiment': 'Neutral'}
🎬 Demonstration¶
This chapter will be dedicated to the Demonstration of the model to the stakeholders and my societal impact teacher. For the final demonstration, I decided on creating a full-stack application, consisting of React frontend and Flask backend.
- When?
The demonstration to the stakeholders and the societal impact teacher will happen on 24th of November.
- Demonstration plan
Introduction: I will provide an overview of the end product and the features.
Demonstrating the app: I will demonstrate the app by inputting 2 new reviews and assessing them. the reviews will be already prepared and are generated by ChatGPT.
Testing: I will let the audience test the app if they want to try their own sentences.
Q&A: I will answer any questions regarding the end product, the process of development or future improvements.
- Demonstration video
📝 Feedback¶
Stakeholders¶
- First Stakeholder Meeting (Manager at Sprinter)
I interviewed the manager and an end user during my proposal step. I had created questions that were curated to see if there is a need of such project in the company.
The manager was really excited about the project and expressed an interest in incorporating AI in their company. Furthermore, they emphasized on the need of an application that can help with assessing customer reviews.
The end user, that I interviewed, also showed interest in a tool for assessing customer reviews, as it will make online shopping more reliable and easy.
Overall, the feedback from the stakeholders confirmed the need of an AI tool in the company and showed interest in the project.
- First Feedback From Societal Impact Teacher (Nick Welman)
For the first version of the project, including the domain understanding, research and proposal, my teacher gave me valuable feedback for my research approach and domain understanding. My teacher was pleased with the results of the first version of the project and the steps that I have made so far.
- Second Stakeholder Meeting (Manager at Sprinter)
The stakeholders expressed satisfaction with the results, I have made so far and the progess in the project. They were hoping to soon incorporate the project into their company.
- Second Feedback from Societal Impact Teacher (Nick Welman)
My teacher expressed satisfaction with the progress I have made. He was happy with the library research and the domain understanding, as well as the clear overview of the perfomance of the model.
- Third Stakeholder Meeting (Manager at Sprinter)
For this meeting, I explained to the stakeholders that the model's predictions are either Positive, Negative, or Neutral. They expressed satisfaction with the progress, all of the different data sources, that show diversity, and the different techniques that were incorporated. However, they gave me a suggestion to include a more comprehensive overview of the results, such as percentages for each of the classes.
- Third Feedback From Societal Impact Teacher (Nick Welman)
My teacher was happy with my progress and suggested to me create one more TICT scan and try to imagine all of the challenges the project could face, if it was already working and incorporated by the company. This advice made me think outside of the box and provide valuble insigths to my project.
- Fourth Stakeholder Meeting - Demonstration (Manager at Sprinter)
The stakeholders were extremely happy that I have decided on creating a full-stack application. They were pleased with the end results and how the model was performing. Furthermore, they are excited to use the model within their company.
- Fourth Feedback from Societal Impact Teacher - Demonstration (Nick Welman)
My teacher was extremely happy with my demonstration using the chatgpt reviews. He thought that I did exactly what I was supposed to do with my project.
🤝 Conclusion¶
In conclusion, I would say that I learned a lot during the implementation of this project. I went through the AI Project Methodology multiple times and every single time, I learned new techniques and made a lot of progress. I bases all of my actions on proper research and domain understanding. I followed my teachers' feedback and always wanted to improve. Every chapter in the notebook is properly explained and done. One of my main goals for this project was transparency of the work that I do and I think I have achieved it. Thanks to this challenge, I improved a lot in the field of AI and learned the importance of each step towards the end goal, such as Domain Understanding, Data Provisioning, Modeling and Optimizations. I am satisifed with the end product as well, which is a full stack application. In my opinion, I managed to satisfy the needs of my stakeholders, as well as accomplish my goals and follow all of the feedback provided by my teachers.